distributed-fine-tuning-hugging-face

Using GPU: True

2025-01-29 23:24:01.929208: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:24:01.944079: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193041.962863 4151 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193041.969120 4151 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:24:01.989466: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

/databricks/python/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(

test:pandas.core.frame.DataFrame = [text: object, label: int64]

train:pandas.core.frame.DataFrame = [text: object, label: int64]

/databricks/python_shell/lib/dbruntime/huggingface_patches/datasets.py:45: UserWarning: The cache_dir for this dataset is /root/.cache, which is not a persistent path.Therefore, if/when the cluster restarts, the downloaded dataset will be lost.The persistent storage options for this workspace/cluster config are: [DBFS, UC Volumes].Please update either `cache_dir` or the environment variable `HF_DATASETS_CACHE`to be under one of the following root directories: ['/dbfs/', '/Volumes/'] warnings.warn(warning_message)

/databricks/python_shell/lib/dbruntime/huggingface_patches/datasets.py:14: UserWarning: During large dataset downloads, there could be multiple progress bar widgets that can cause performance issues for your notebook or browser. To avoid these issues, use `datasets.utils.logging.disable_progress_bar()` to turn off the progress bars. warnings.warn(

Set up the training function

The TorchDistributor API has support for single node multi-GPU training as well as multi-node training.

When you wrap the single-node code in the train() function, Databricks recommends you include all the import statements inside the train() function to avoid library pickling issues. You can return any picklable object in train_model(), but that means you can't return Trainer since that can't be picklable without a process group. You can instead return the best checkpoint path and use that externally.

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

/databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [rank0]:[W1114 20:22:27.129875241 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

2024/11/14 20:34:57 INFO mlflow.tracking._tracking_service.client: 🏃 View run tmp_trainer at: adb-7064161269814046.2.staging.azuredatabricks.net/ml/experiments/1555091981978485/runs/b8aed482de7e4e86b9b03f3b9ed9a793. 2024/11/14 20:34:57 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-7064161269814046.2.staging.azuredatabricks.net/ml/experiments/1555091981978485.

{'eval_loss': 0.23018266260623932, 'eval_model_preparation_time': 0.0015, 'eval_accuracy': 0.92, 'eval_f1': 0.9218444704962876, 'eval_runtime': 27.4925, 'eval_samples_per_second': 181.868, 'eval_steps_per_second': 22.733}

We're using 4 GPUs INFO:TorchDistributor:Started local training with 4 processes WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) 2025-01-29 23:25:01.347597: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:25:01.363265: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-01-29 23:25:01.378850: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different 2025-01-29 23:25:01.378853: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environmennumerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. t variable `TF_ENABLE_ONEDNN_OPTS=0`. WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193101.383382 5049 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193101.389494 5049 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:25:01.394598: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Att2025-01-29 23:25:01.394600: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered empting to register factory for plugin cuFFT when one has already been registered 2025-01-29 23:25:01.409278: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193101.415002 5051 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193101.415014 5050 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193101.421111 5051 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1738193101.421117 5050 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:25:01.425323: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:25:01.440861: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-01-29 23:25:01.441018: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use avai2025-01-29 23:25:01.441018: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. lable CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193101.461017 5048 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193101.467112 5048 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:25:01.486600: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-01-29 23:25:07,424] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-29 23:25:07,424] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-29 23:25:07,425] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-29 23:25:07,425] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd 0%| | 0/626 [00:00<?, ?it/s][rank1]:[W129 23:25:12.650214869 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank3]:[W129 23:25:12.670947765 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank2]:[W129 23:25:12.705792391 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W129 23:25:12.718928363 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) 50%|█████ | 313/626 [03:43<03:44, 1.39it/s] 0%| | 0/79 [00:00<?, ?it/s] 3%|▎ | 2/79 [00:00<00:08, 8.81it/s] 4%|▍ | 3/79 [00:00<00:12, 6.11it/s] 5%|▌ | 4/79 [00:00<00:14, 5.12it/s] 6%|▋ | 5/79 [00:00<00:15, 4.73it/s] 8%|▊ | 6/79 [00:01<00:16, 4.56it/s] 9%|▉ | 7/79 [00:01<00:16, 4.47it/s] 10%|█ | 8/79 [00:01<00:16, 4.37it/s] 11%|█▏ | 9/79 [00:01<00:16, 4.28it/s] 13%|█▎ | 10/79 [00:02<00:16, 4.27it/s] 14%|█▍ | 11/79 [00:02<00:15, 4.28it/s] 15%|█▌ | 12/79 [00:02<00:15, 4.26it/s] 16%|█▋ | 13/79 [00:02<00:15, 4.25it/s] 18%|█▊ | 14/79 [00:03<00:15, 4.24it/s] 19%|█▉ | 15/79 [00:03<00:15, 4.25it/s] 20%|██ | 16/79 [00:03<00:14, 4.25it/s] 22%|██▏ | 17/79 [00:03<00:14, 4.23it/s] 23%|██▎ | 18/79 [00:04<00:14, 4.22it/s] 24%|██▍ | 19/79 [00:04<00:14, 4.22it/s] 25%|██▌ | 20/79 [00:04<00:13, 4.24it/s] 27%|██▋ | 21/79 [00:04<00:13, 4.23it/s] 28%|██▊ | 22/79 [00:04<00:13, 4.22it/s] 29%|██▉ | 23/79 [00:05<00:13, 4.23it/s] 30%|███ | 24/79 [00:05<00:13, 4.23it/s] 32%|███▏ | 25/79 [00:05<00:12, 4.23it/s] 33%|███▎ | 26/79 [00:05<00:12, 4.24it/s] 34%|███▍ | 27/79 [00:06<00:12, 4.22it/s] 35%|███▌ | 28/79 [00:06<00:12, 4.22it/s] 37%|███▋ | 29/79 [00:06<00:11, 4.24it/s] 38%|███▊ | 30/79 [00:06<00:11, 4.23it/s] 39%|███▉ | 31/79 [00:07<00:11, 4.22it/s] 41%|████ | 32/79 [00:07<00:11, 4.23it/s] 42%|████▏ | 33/79 [00:07<00:10, 4.24it/s] 43%|████▎ | 34/79 [00:07<00:10, 4.24it/s] 44%|████▍ | 35/79 [00:08<00:10, 4.24it/s] 46%|████▌ | 36/79 [00:08<00:10, 4.23it/s] 47%|████▋ | 37/79 [00:08<00:09, 4.27it/s] 48%|████▊ | 38/79 [00:08<00:09, 4.27it/s] 49%|████▉ | 39/79 [00:08<00:09, 4.28it/s] 51%|█████ | 40/79 [00:09<00:09, 4.27it/s] 52%|█████▏ | 41/79 [00:09<00:09, 4.21it/s] 53%|█████▎ | 42/79 [00:09<00:08, 4.20it/s] 54%|█████▍ | 43/79 [00:09<00:08, 4.22it/s] 56%|█████▌ | 44/79 [00:10<00:08, 4.23it/s] 57%|█████▋ | 45/79 [00:10<00:08, 4.21it/s] 58%|█████▊ | 46/79 [00:10<00:07, 4.19it/s] 59%|█████▉ | 47/79 [00:10<00:07, 4.20it/s] 61%|██████ | 48/79 [00:11<00:07, 4.23it/s] 62%|██████▏ | 49/79 [00:11<00:07, 4.21it/s] 63%|██████▎ | 50/79 [00:11<00:06, 4.19it/s] 65%|██████▍ | 51/79 [00:11<00:06, 4.19it/s] 66%|██████▌ | 52/79 [00:12<00:06, 4.22it/s] 67%|██████▋ | 53/79 [00:12<00:06, 4.22it/s] 68%|██████▊ | 54/79 [00:12<00:05, 4.22it/s] 70%|██████▉ | 55/79 [00:12<00:05, 4.23it/s] 71%|███████ | 56/79 [00:13<00:05, 4.26it/s] 72%|███████▏ | 57/79 [00:13<00:05, 4.25it/s] 73%|███████▎ | 58/79 [00:13<00:04, 4.27it/s] 75%|███████▍ | 59/79 [00:13<00:04, 4.27it/s] 76%|███████▌ | 60/79 [00:13<00:04, 4.20it/s] 77%|███████▋ | 61/79 [00:14<00:04, 4.23it/s] 78%|███████▊ | 62/79 [00:14<00:04, 4.22it/s] 80%|███████▉ | 63/79 [00:14<00:03, 4.24it/s] 81%|████████ | 64/79 [00:14<00:03, 4.21it/s] 82%|████████▏ | 65/79 [00:15<00:03, 4.20it/s] 84%|████████▎ | 66/79 [00:15<00:03, 4.20it/s] 85%|████████▍ | 67/79 [00:15<00:02, 4.22it/s] 86%|████████▌ | 68/79 [00:15<00:02, 4.22it/s] 87%|████████▋ | 69/79 [00:16<00:02, 4.20it/s] 89%|████████▊ | 70/79 [00:16<00:02, 4.18it/s] 90%|████████▉ | 71/79 [00:16<00:01, 4.21it/s] 91%|█████████ | 72/79 [00:16<00:01, 4.25it/s] 92%|█████████▏| 73/79 [00:17<00:01, 4.26it/s] 94%|█████████▎| 74/79 [00:17<00:01, 4.28it/s] 95%|█████████▍| 75/79 [00:17<00:00, 4.27it/s] 96%|█████████▌| 76/79 [00:17<00:00, 4.27it/s] 97%|█████████▋| 77/79 [00:17<00:00, 4.27it/s] 99%|█████████▊| 78/79 [00:18<00:00, 4.23it/s] 100%|██████████| 79/79 [00:18<00:00, 4.27it/s]/root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate /root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate /root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate /root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate Downloading builder script: 4.21kB [00:00, 18.1MB/s]65k [00:00<?, ?B/s] Downloading builder script: 6.50kB [00:00, 22.8MB/s]32k [00:00<?, ?B/s] {'eval_loss': 0.2115330994129181, 'eval_accuracy': 0.917, 'eval_f1': 0.9166164356037774, 'eval_runtime': 20.0585, 'eval_samples_per_second': 249.271, 'eval_steps_per_second': 3.938, 'epoch': 1.0} 50%|█████ | 313/626 [04:03<03:44, 1.39it/s] 100%|██████████| 79/79 [00:19<00:00, 4.27it/s] {'loss': 0.2468, 'grad_norm': 2.072420597076416, 'learning_rate': 4.02555910543131e-06, 'epoch': 1.6} INFO:TorchDistributor:Finished local training with 4 processes 100%|██████████| 626/626 [07:51<00:00, 1.39it/s] 0%| | 0/79 [00:00<?, ?it/s] 3%|▎ | 2/79 [00:00<00:08, 8.79it/s] 4%|▍ | 3/79 [00:00<00:12, 6.16it/s] 5%|▌ | 4/79 [00:00<00:14, 5.33it/s] 6%|▋ | 5/79 [00:00<00:15, 4.71it/s] 8%|▊ | 6/79 [00:01<00:16, 4.55it/s] 9%|▉ | 7/79 [00:01<00:16, 4.49it/s] 10%|█ | 8/79 [00:01<00:16, 4.42it/s] 11%|█▏ | 9/79 [00:01<00:16, 4.33it/s] 13%|█▎ | 10/79 [00:02<00:16, 4.29it/s] 14%|█▍ | 11/79 [00:02<00:15, 4.28it/s] 15%|█▌ | 12/79 [00:02<00:15, 4.30it/s] 16%|█▋ | 13/79 [00:02<00:15, 4.29it/s] 18%|█▊ | 14/79 [00:03<00:15, 4.27it/s] 19%|█▉ | 15/79 [00:03<00:15, 4.26it/s] 20%|██ | 16/79 [00:03<00:14, 4.26it/s] 22%|██▏ | 17/79 [00:03<00:14, 4.27it/s] 23%|██▎ | 18/79 [00:04<00:14, 4.23it/s] 24%|██▍ | 19/79 [00:04<00:14, 4.23it/s] 25%|██▌ | 20/79 [00:04<00:13, 4.23it/s] 27%|██▋ | 21/79 [00:04<00:13, 4.24it/s] 28%|██▊ | 22/79 [00:04<00:13, 4.23it/s] 29%|██▉ | 23/79 [00:05<00:13, 4.25it/s] 30%|███ | 24/79 [00:05<00:13, 4.23it/s] 32%|███▏ | 25/79 [00:05<00:12, 4.23it/s] 33%|███▎ | 26/79 [00:05<00:12, 4.24it/s] 34%|███▍ | 27/79 [00:06<00:12, 4.24it/s] 35%|███▌ | 28/79 [00:06<00:12, 4.22it/s] 37%|███▋ | 29/79 [00:06<00:11, 4.22it/s] 38%|███▊ | 30/79 [00:06<00:11, 4.24it/s] 39%|███▉ | 31/79 [00:07<00:11, 4.23it/s] 41%|████ | 32/79 [00:07<00:11, 4.23it/s] 42%|████▏ | 33/79 [00:07<00:10, 4.24it/s] 43%|████▎ | 34/79 [00:07<00:10, 4.24it/s] 44%|████▍ | 35/79 [00:08<00:10, 4.24it/s] 46%|████▌ | 36/79 [00:08<00:10, 4.23it/s] 47%|████▋ | 37/79 [00:08<00:09, 4.26it/s] 48%|████▊ | 38/79 [00:08<00:09, 4.28it/s] 49%|████▉ | 39/79 [00:08<00:09, 4.27it/s] 51%|█████ | 40/79 [00:09<00:09, 4.29it/s] 52%|█████▏ | 41/79 [00:09<00:08, 4.22it/s] 53%|█████▎ | 42/79 [00:09<00:08, 4.20it/s] 54%|█████▍ | 43/79 [00:09<00:08, 4.21it/s] 56%|█████▌ | 44/79 [00:10<00:08, 4.24it/s] 57%|█████▋ | 45/79 [00:10<00:08, 4.22it/s] 58%|█████▊ | 46/79 [00:10<00:07, 4.22it/s] 59%|█████▉ | 47/79 [00:10<00:07, 4.23it/s] 61%|██████ | 48/79 [00:11<00:07, 4.24it/s] 62%|██████▏ | 49/79 [00:11<00:07, 4.24it/s] 63%|██████▎ | 50/79 [00:11<00:06, 4.23it/s] 65%|██████▍ | 51/79 [00:11<00:06, 4.20it/s] 66%|██████▌ | 52/79 [00:12<00:06, 4.20it/s] 67%|██████▋ | 53/79 [00:12<00:06, 4.23it/s] 68%|██████▊ | 54/79 [00:12<00:05, 4.23it/s] 70%|██████▉ | 55/79 [00:12<00:05, 4.22it/s] 71%|███████ | 56/79 [00:12<00:05, 4.25it/s] 72%|███████▏ | 57/79 [00:13<00:05, 4.26it/s] 73%|███████▎ | 58/79 [00:13<00:04, 4.26it/s] 75%|███████▍ | 59/79 [00:13<00:04, 4.27it/s] 76%|███████▌ | 60/79 [00:13<00:04, 4.18it/s] 77%|███████▋ | 61/79 [00:14<00:04, 4.20it/s] 78%|███████▊ | 62/79 [00:14<00:04, 4.22it/s] 80%|███████▉ | 63/79 [00:14<00:03, 4.25it/s] 81%|████████ | 64/79 [00:14<00:03, 4.22it/s] 82%|████████▏ | 65/79 [00:15<00:03, 4.21it/s] 84%|████████▎ | 66/79 [00:15<00:03, 4.24it/s] 85%|████████▍ | 67/79 [00:15<00:02, 4.25it/s] 86%|████████▌ | 68/79 [00:15<00:02, 4.24it/s] 87%|████████▋ | 69/79 [00:16<00:02, 4.21it/s] 89%|████████▊ | 70/79 [00:16<00:02, 4.22it/s] 90%|████████▉ | 71/79 [00:16<00:01, 4.24it/s] 91%|█████████ | 72/79 [00:16<00:01, 4.27it/s] 92%|█████████▏| 73/79 [00:16<00:01, 4.27it/s] 94%|█████████▎| 74/79 [00:17<00:01, 4.27it/s] 95%|█████████▍| 75/79 [00:17<00:00, 4.26it/s] 96%|█████████▌| 76/79 [00:17<00:00, 4.27it/s] 97%|█████████▋| 77/79 [00:17<00:00, 4.27it/s] 99%|█████████▊| 78/79 [00:18<00:00, 4.22it/s] 100%|██████████| 79/79 [00:18<00:00, 4.28it/s] {'eval_loss': 0.20669345557689667, 'eval_accuracy': 0.9214, 'eval_f1': 0.9211001806866091, 'eval_runtime': 19.2586, 'eval_samples_per_second': 259.625, 'eval_steps_per_second': 4.102, 'epoch': 2.0} 100%|██████████| 626/626 [08:14<00:00, 1.39it/s] 100%|██████████| 79/79 [00:18<00:00, 4.28it/s] {'train_runtime': 505.0831, 'train_samples_per_second': 79.195, 'train_steps_per_second': 1.239, 'train_loss': 0.23027655263297475, 'epoch': 2.0} 100%|██████████| 626/626 [08:25<00:00, 1.24it/s] [rank0]:[W129 23:33:38.457176307 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

[2025-01-29 23:34:28,555] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /usr/bin/ld: cannot find -laio: No such file or directory collect2: error: ld returned 1 exit status [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd

/root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate load_accuracy = load_metric("accuracy", trust_remote_code=True) 2025/01/29 23:36:43 INFO mlflow.tracking._tracking_service.client: 🏃 View run tmp_trainer at: db-sme-demo-docs.cloud.databricks.com/ml/experiments/1126725452244120/runs/4724cc0448364946b48fd74a42ce77fc. 2025/01/29 23:36:43 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: db-sme-demo-docs.cloud.databricks.com/ml/experiments/1126725452244120.

{'eval_loss': 0.20622585713863373, 'eval_model_preparation_time': 0.0015, 'eval_accuracy': 0.9214, 'eval_f1': 0.9211001806866091, 'eval_runtime': 133.9888, 'eval_samples_per_second': 37.317, 'eval_steps_per_second': 4.665}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

We're using 4 GPUs INFO:TorchDistributor:Started distributed training with 4 executor processes /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) 2025-01-29 23:37:41.123846: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:37:41.139473: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193861.159439 4657 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193861.165562 4657 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:37:41.185187: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) /databricks/python/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(io.BytesIO(b)) /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( 2025-01-29 23:37:44.942051: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:37:44.957797: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193864.977839 4662 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193864.984010 4662 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:37:45.003654: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-01-29 23:37:46.346609: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:37:46.362514: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193866.382942 4800 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193866.389134 4800 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:37:46.408868: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-01-29 23:37:46.523356: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-01-29 23:37:46.539200: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738193866.559464 4620 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1738193866.565691 4620 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-29 23:37:46.585397: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /databricks/python/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-01-29 23:37:51,809] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd [2025-01-29 23:37:55,679] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd [2025-01-29 23:37:57,223] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd [2025-01-29 23:37:59,280] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /databricks/python/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd [rank1]:[W129 23:38:06.553306591 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank2]:[W129 23:38:06.447378056 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) 0%| | 0/626 [00:00<?, ?it/s][rank0]:[W129 23:38:06.616127093 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank3]:[W129 23:38:06.290815970 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) 13%|█▎ | 82/626 [01:08<07:30, ...(truncated) 0%| | 0/79 [00:00<?, ?it/s] 3%|▎ | 2/79 [00:00<00:09, 8.50it/s] 4%|▍ | 3/79 [00:00<00:12, 6.01it/s] 5%|▌ | 4/79 [00:00<00:15, 4.93it/s] 6%|▋ | 5/79 [00:00<00:15, 4.65it/s] 8%|▊ | 6/79 [00:01<00:16, 4.52it/s] 9%|▉ | 7/79 [00:01<00:16, 4.43it/s] 10%|█ | 8/79 [00:01<00:16, 4.27it/s] 11%|█▏ | 9/79 [00:01<00:16, 4.23it/s] 13%|█▎ | 10/79 [00:02<00:16, 4.22it/s] 14%|█▍ | 11/79 [00:02<00:16, 4.21it/s] 15%|█▌ | 12/79 [00:02<00:16, 4.15it/s] 16%|█▋ | 13/79 [00:02<00:15, 4.15it/s] 18%|█▊ | 14/79 [00:03<00:15, 4.14it/s] 19%|█▉ | 15/79 [00:03<00:15, 4.15it/s] 20%|██ | 16/79 [00:03<00:15, 4.14it/s] 22%|██▏ | 17/79 [00:03<00:15, 4.13it/s] 23%|██▎ | 18/79 [00:04<00:14, 4.13it/s] 24%|██▍ | 19/79 [00:04<00:14, 4.16it/s] 25%|██▌ | 20/79 [00:04<00:14, 4.17it/s] 27%|██▋ | 21/79 [00:04<00:13, 4.15it/s] 28%|██▊ | 22/79 [00:05<00:13, 4.15it/s] 29%|██▉ | 23/79 [00:05<00:13, 4.13it/s] 30%|███ | 24/79 [00:05<00:13, 4.14it/s] 32%|███▏ | 25/79 [00:05<00:13, 4.12it/s] 33%|███▎ | 26/79 [00:06<00:12, 4.12it/s] 34%|███▍ | 27/79 [00:06<00:12, 4.12it/s] 35%|███▌ | 28/79 [00:06<00:12, 4.12it/s] 37%|███▋ | 29/79 [00:06<00:12, 4.14it/s] 38%|███▊ | 30/79 [00:07<00:11, 4.13it/s] 39%|███▉ | 31/79 [00:07<00:11, 4.13it/s] 41%|████ | 32/79 [00:07<00:11, 4.13it/s] 42%|████▏ | 33/79 [00:07<00:11, 4.14it/s] 43%|████▎ | 34/79 [00:07<00:10, 4.15it/s] 44%|████▍ | 35/79 [00:08<00:10, 4.17it/s] 46%|████▌ | 36/79 [00:08<00:10, 4.15it/s] 47%|████▋ | 37/79 [00:08<00:10, 4.14it/s] 48%|████▊ | 38/79 [00:08<00:09, 4.13it/s] 49%|████▉ | 39/79 [00:09<00:09, 4.13it/s] 51%|█████ | 40/79 [00:09<00:09, 4.14it/s] 52%|█████▏ | 41/79 [00:09<00:09, 4.10it/s] 53%|█████▎ | 42/79 [00:09<00:08, 4.12it/s] 54%|█████▍ | 43/79 [00:10<00:08, 4.12it/s] 56%|█████▌ | 44/79 [00:10<00:08, 4.15it/s] 57%|█████▋ | 45/79 [00:10<00:08, 4.14it/s] 58%|█████▊ | 46/79 [00:10<00:07, 4.13it/s] 59%|█████▉ | 47/79 [00:11<00:07, 4.15it/s] 61%|██████ | 48/79 [00:11<00:07, 4.13it/s] 62%|██████▏ | 49/79 [00:11<00:07, 4.14it/s] 63%|██████▎ | 50/79 [00:11<00:06, 4.16it/s] 65%|██████▍ | 51/79 [00:12<00:06, 4.16it/s] 66%|██████▌ | 52/79 [00:12<00:06, 4.14it/s] 67%|██████▋ | 53/79 [00:12<00:06, 4.13it/s] 68%|██████▊ | 54/79 [00:12<00:06, 4.13it/s] 70%|██████▉ | 55/79 [00:13<00:05, 4.16it/s] 71%|███████ | 56/79 [00:13<00:05, 4.21it/s] 72%|███████▏ | 57/79 [00:13<00:05, 4.18it/s] 73%|███████▎ | 58/79 [00:13<00:05, 4.16it/s] 75%|███████▍ | 59/79 [00:14<00:04, 4.11it/s] 76%|███████▌ | 60/79 [00:14<00:04, 4.13it/s] 77%|███████▋ | 61/79 [00:14<00:04, 4.16it/s] 78%|███████▊ | 62/79 [00:14<00:04, 4.14it/s] 80%|███████▉ | 63/79 [00:14<00:03, 4.11it/s] 81%|████████ | 64/79 [00:15<00:03, 4.13it/s] 82%|████████▏ | 65/79 [00:15<00:03, 4.14it/s] 84%|████████▎ | 66/79 [00:15<00:03, 4.14it/s] 85%|████████▍ | 67/79 [00:15<00:02, 4.11it/s] 86%|████████▌ | 68/79 [00:16<00:02, 4.11it/s] 87%|████████▋ | 69/79 [00:16<00:02, 4.15it/s] 89%|████████▊ | 70/79 [00:16<00:02, 4.17it/s] 90%|████████▉ | 71/79 [00:16<00:01, 4.18it/s] 91%|█████████ | 72/79 [00:17<00:01, 4.17it/s] 92%|█████████▏| 73/79 [00:17<00:01, 4.20it/s] 94%|█████████▎| 74/79 [00:17<00:01, 4.22it/s] 95%|█████████▍| 75/79 [00:17<00:00, 4.24it/s] 96%|█████████▌| 76/79 [00:18<00:00, 4.22it/s] 97%|█████████▋| 77/79 [00:18<00:00, 4.23it/s] 99%|█████████▊| 78/79 [00:18<00:00, 4.17it/s] 100%|██████████| 79/79 [00:18<00:00, 4.15it/s]/root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate /root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate /root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate /root/.ipykernel/4151/command-1126725452244129-2408551101:8: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate Downloading builder script: 4.21kB [00:00, 18.9MB/s]65k [00:00<?, ?B/s] Downloading builder script: 4.21kB [00:00, 19.1MB/s]65k [00:00<?, ?B/s] Downloading builder script: 4.21kB [00:00, 18.9MB/s]65k [00:00<?, ?B/s] Downloading builder script: 4.21kB [00:00, 16.7MB/s]65k [00:00<?, ?B/s] Downloading builder script: 6.50kB [00:00, 29.3MB/s]32k [00:00<?, ?B/s] Downloading builder script: 6.50kB [00:00, 29.6MB/s]32k [00:00<?, ?B/s] Downloading builder script: 6.50kB [00:00, 23.8MB/s]32k [00:00<?, ?B/s] Downloading builder script: 6.50kB [00:00, 28.2MB/s]32k [00:00<?, ?B/s] {'eval_loss': 0.211540088057518, 'eval_accuracy': 0.9172, 'eval_f1': 0.9168006430868167, 'eval_runtime': 20.042, 'eval_samples_per_second': 249.476, 'eval_steps_per_second': 3.942, 'epoch': 1.0} 50%|█████ | 313/626 [04:39<04:19, 1.21it/s] 100%|██████████| 79/79 [00:19<00:00, 4.15it/s] 63%|██████▎ | 392/626 [05:48<03:13, 1.21it/s...(truncated) 92%|█████████▏| 579/626 [08:23<00:38, 1.21it/s]...(truncated) 0%| | 0/79 [00:00<?, ?it/s] 3%|▎ | 2/79 [00:00<00:08, 8.58it/s] 4%|▍ | 3/79 [00:00<00:12, 6.06it/s] 5%|▌ | 4/79 [00:00<00:14, 5.21it/s] 6%|▋ | 5/79 [00:00<00:15, 4.63it/s] 8%|▊ | 6/79 [00:01<00:16, 4.50it/s] 9%|▉ | 7/79 [00:01<00:16, 4.41it/s] 10%|█ | 8/79 [00:01<00:16, 4.36it/s] 11%|█▏ | 9/79 [00:01<00:16, 4.26it/s] 13%|█▎ | 10/79 [00:02<00:16, 4.22it/s] 14%|█▍ | 11/79 [00:02<00:16, 4.22it/s] 15%|█▌ | 12/79 [00:02<00:15, 4.22it/s] 16%|█▋ | 13/79 [00:02<00:15, 4.20it/s] 18%|█▊ | 14/79 [00:03<00:15, 4.16it/s] 19%|█▉ | 15/79 [00:03<00:15, 4.16it/s] 20%|██ | 16/79 [00:03<00:15, 4.16it/s] 22%|██▏ | 17/79 [00:03<00:14, 4.16it/s] 23%|██▎ | 18/79 [00:04<00:14, 4.14it/s] 24%|██▍ | 19/79 [00:04<00:14, 4.15it/s] 25%|██▌ | 20/79 [00:04<00:14, 4.17it/s] 27%|██▋ | 21/79 [00:04<00:13, 4.18it/s] 28%|██▊ | 22/79 [00:05<00:13, 4.15it/s] 29%|██▉ | 23/79 [00:05<00:13, 4.14it/s] 30%|███ | 24/79 [00:05<00:13, 4.14it/s] 32%|███▏ | 25/79 [00:05<00:13, 4.15it/s] 33%|███▎ | 26/79 [00:06<00:12, 4.13it/s] 34%|███▍ | 27/79 [00:06<00:12, 4.11it/s] 35%|███▌ | 28/79 [00:06<00:12, 4.12it/s] 37%|███▋ | 29/79 [00:06<00:12, 4.16it/s] 38%|███▊ | 30/79 [00:06<00:11, 4.15it/s] 39%|███▉ | 31/79 [00:07<00:11, 4.14it/s] 41%|████ | 32/79 [00:07<00:11, 4.15it/s] 42%|████▏ | 33/79 [00:07<00:11, 4.17it/s] 43%|████▎ | 34/79 [00:07<00:10, 4.17it/s] 44%|████▍ | 35/79 [00:08<00:10, 4.16it/s] 46%|████▌ | 36/79 [00:08<00:10, 4.14it/s] 47%|████▋ | 37/79 [00:08<00:10, 4.15it/s] 48%|████▊ | 38/79 [00:08<00:09, 4.13it/s] 49%|████▉ | 39/79 [00:09<00:09, 4.12it/s] 51%|█████ | 40/79 [00:09<00:09, 4.12it/s] 52%|█████▏ | 41/79 [00:09<00:09, 4.13it/s] 53%|█████▎ | 42/79 [00:09<00:08, 4.12it/s] 54%|█████▍ | 43/79 [00:10<00:08, 4.12it/s] 56%|█████▌ | 44/79 [00:10<00:08, 4.15it/s] 57%|█████▋ | 45/79 [00:10<00:08, 4.15it/s] 58%|█████▊ | 46/79 [00:10<00:07, 4.16it/s] 59%|█████▉ | 47/79 [00:11<00:07, 4.15it/s] 61%|██████ | 48/79 [00:11<00:07, 4.14it/s] 62%|██████▏ | 49/79 [00:11<00:07, 4.15it/s] 63%|██████▎ | 50/79 [00:11<00:06, 4.17it/s] 65%|██████▍ | 51/79 [00:12<00:06, 4.16it/s] 66%|██████▌ | 52/79 [00:12<00:06, 4.13it/s] 67%|██████▋ | 53/79 [00:12<00:06, 4.13it/s] 68%|██████▊ | 54/79 [00:12<00:06, 4.14it/s] 70%|██████▉ | 55/79 [00:13<00:05, 4.15it/s] 71%|███████ | 56/79 [00:13<00:05, 4.20it/s] 72%|███████▏ | 57/79 [00:13<00:05, 4.20it/s] 73%|███████▎ | 58/79 [00:13<00:05, 4.18it/s] 75%|███████▍ | 59/79 [00:13<00:04, 4.11it/s] 76%|███████▌ | 60/79 [00:14<00:04, 4.13it/s] 77%|███████▋ | 61/79 [00:14<00:04, 4.16it/s] 78%|███████▊ | 62/79 [00:14<00:04, 4.17it/s] 80%|███████▉ | 63/79 [00:14<00:03, 4.12it/s] 81%|████████ | 64/79 [00:15<00:03, 4.13it/s] 82%|████████▏ | 65/79 [00:15<00:03, 4.13it/s] 84%|████████▎ | 66/79 [00:15<00:03, 4.14it/s] 85%|████████▍ | 67/79 [00:15<00:02, 4.11it/s] 86%|████████▌ | 68/79 [00:16<00:02, 4.10it/s] 87%|████████▋ | 69/79 [00:16<00:02, 4.15it/s] 89%|████████▊ | 70/79 [00:16<00:02, 4.16it/s] 90%|████████▉ | 71/79 [00:16<00:01, 4.16it/s] 91%|█████████ | 72/79 [00:17<00:01, 4.18it/s] 92%|█████████▏| 73/79 [00:17<00:01, 4.21it/s] 94%|█████████▎| 74/79 [00:17<00:01, 4.22it/s] 95%|█████████▍| 75/79 [00:17<00:00, 4.23it/s] 96%|█████████▌| 76/79 [00:18<00:00, 4.23it/s] 97%|█████████▋| 77/79 [00:18<00:00, 4.24it/s] 99%|█████████▊| 78/79 [00:18<00:00, 4.18it/s] 100%|██████████| 79/79 [00:18<00:00, 4.15it/s] {'eval_loss': 0.20667248964309692, 'eval_accuracy': 0.9214, 'eval_f1': 0.9210684876481221, 'eval_runtime': 19.6474, 'eval_samples_per_second': 254.486, 'eval_steps_per_second': 4.021, 'epoch': 2.0} 100%|██████████| 626/626 [09:25<00:00, 1.21it/s] 100%|██████████| 79/79 [00:19<00:00, 4.15it/s] {'train_runtime': 573.3609, 'train_samples_per_second': 69.764, 'train_steps_per_second': 1.092, 'train_loss': 0.2302702273042819, 'epoch': 2.0} 100%|██████████| 626/626 [09:33<00:00, 1.09it/s] [rank0]:[W129 23:47:39.492236604 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank1]:[W129 23:47:39.486471750 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank3]:[W129 23:47:39.210770637 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank2]:[W129 23:47:39.413852168 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) INFO:TorchDistributor:Finished distributed training with 4 executor processes

2025/01/29 23:49:58 INFO mlflow.tracking._tracking_service.client: 🏃 View run tmp_trainer at: db-sme-demo-docs.cloud.databricks.com/ml/experiments/1126725452244120/runs/d27916bd85774058994538edc83f36a5. 2025/01/29 23:49:58 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: db-sme-demo-docs.cloud.databricks.com/ml/experiments/1126725452244120.

{'eval_loss': 0.20620520412921906, 'eval_model_preparation_time': 0.0013, 'eval_accuracy': 0.9214, 'eval_f1': 0.9210684876481221, 'eval_runtime': 130.551, 'eval_samples_per_second': 38.299, 'eval_steps_per_second': 4.787}

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

[{'i love this movie': 'Positive'}, {'this movie sucks!': 'Negative'}]

distributed-fine-tuning-hugging-face(Python)

Distributed fine-tuning with the Hugging Face Transformers API

Requirements

Define the number of GPUs to use

Preprocess your data

Import and preprocess the IMDB dataset

Set up the training function

Run local training

Run distributed training on a single node with multiple GPUs

Run distributed training on multi-node

Test the model with Transformers's `pipeline` API

distributed-fine-tuning-hugging-face(Python)

Distributed fine-tuning with the Hugging Face Transformers API

Requirements

Define the number of GPUs to use

Preprocess your data

Import and preprocess the IMDB dataset

Set up the training function

Run local training

Run distributed training on a single node with multiple GPUs

Run distributed training on multi-node

Test the model with Transformers's pipeline API

Test the model with Transformers's `pipeline` API