torch-distributor-notebook(Python)

Loading...

End to end distributed training on a Databricks Notebook

Distributed training on PyTorch is often done by creating a file (train.py) and using the torchrun CLI to run distributed training using that file. Databricks offers a method of doing distributed training directly on a Databricks notebook. You can define the train() function within a notebook and use the TorchDistributor API to train the model across the workers.

This notebook illustrates how to develop interactively within a notebook. Particularly with larger deep learning projects, Databricks recommends leveraging the %run command in order to split up your code into manageable chunks.

In this notebook, you:

  • Train a simple single GPU model on the classic MNIST dataset
  • Adapt that code for distributed training
  • Learn how the TorchDistributor can be leveraged to help you scale up the model training across multiple GPUs or multiple nodes.

Requirements

  • Databricks Runtime ML 13.0 and above
  • This notebook should be run on a cluster with Single User access mode. If the cluster should be shared with other team members, contact your Databricks account team for solutions.
  • (Recommended) GPU instances AWS | Azure | GCP

MLflow setup

MLflow is a tool to support the tracking of machine learning experiments and logging of models. The db_host variable controls the MLflow tracking server and needs to be set to the URL of the workspace.

NOTE The MLflow PyTorch Autologging APIs are designed for PyTorch Lightning and won't work with Native PyTorch

3

Define train and test functions

The following cell contains code that describes the model, the train function, and the testing function; all of which are designed to run locally. Next, the code introduces the changes needed to move training from the local setting to a distributed setting.

All the torch code leverages standard PyTorch APIs, there are no custom libraries required or alterations in the way the code is written. This notebook focuses on how to scale your training with TorchDistributor and does not go through the model code.

5

6

Log directory: /dbfs/ml/pytorch/1733957302.7722206

Train the model locally

To test that this runs correctly, you can trigger a train and test iteration using the functions defined above.

8

/root/.ipykernel/4182/command-2280852430858390-1170614654:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return F.log_softmax(x) Train Epoch: 1 [0/60000 (0%)] Loss: 2.324770 Train Epoch: 1 [10000/60000 (17%)] Loss: 2.297613 Train Epoch: 1 [20000/60000 (33%)] Loss: 2.277661 Train Epoch: 1 [30000/60000 (50%)] Loss: 2.290999 Train Epoch: 1 [40000/60000 (67%)] Loss: 2.275110 Train Epoch: 1 [50000/60000 (83%)] Loss: 2.220335 Train Epoch: 2 [0/60000 (0%)] Loss: 2.224195 Train Epoch: 2 [10000/60000 (17%)] Loss: 2.176879 Train Epoch: 2 [20000/60000 (33%)] Loss: 2.169643 Train Epoch: 2 [30000/60000 (50%)] Loss: 2.138331 Train Epoch: 2 [40000/60000 (67%)] Loss: 2.017443 Train Epoch: 2 [50000/60000 (83%)] Loss: 1.934057 Train Epoch: 3 [0/60000 (0%)] Loss: 1.731840 Train Epoch: 3 [10000/60000 (17%)] Loss: 1.870500 Train Epoch: 3 [20000/60000 (33%)] Loss: 1.674558 Train Epoch: 3 [30000/60000 (50%)] Loss: 1.598748 Train Epoch: 3 [40000/60000 (67%)] Loss: 1.360065 Train Epoch: 3 [50000/60000 (83%)] Loss: 1.438313 Average test loss: 0.895073413848877 /databricks/python/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") 2024/12/11 22:45:11 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.15.1/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model.
2024/12/11 22:45:12 WARNING mlflow.models.model: Input example should be provided to infer model signature if the model signature is not provided when logging the model. 2024/12/11 22:45:12 INFO mlflow.tracking._tracking_service.client: 🏃 View run popular-koi-949 at: e2-dogfood.staging.cloud.databricks.com/ml/experiments/2280852430858633/runs/ff899ade65ad40fc8b17fd7e351015a9. 2024/12/11 22:45:12 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: e2-dogfood.staging.cloud.databricks.com/ml/experiments/2280852430858633.

Distributed Setup

When you wrap the single-node code in the train() function, Databricks recommends you include all the import statements inside the train() function to avoid library pickling issues.

Everything else is what is normally required for getting distributed training to work within PyTorch.

  • Calling dist.init_process_group("nccl") at the beginning of train()
  • Calling dist.destroy_process_group() at the end of train()
  • Setting local_rank = int(os.environ["LOCAL_RANK"])
  • Adding a DistributedSampler to the DataLoader
  • Wrapping the model with a DDP(model)
  • For more information, view https://pytorch.org/tutorials/intermediate/ddp_series_multinode.html
10

Data is located at: /dbfs/ml/pytorch/1733957479.3564515

Test without TorchDistributor

The following validates our training loop by running training on a single GPU.

12

Running distributed training /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return F.log_softmax(x) Train Epoch: 1 [0/60000 (0%)] Loss: 2.284188 Train Epoch: 1 [10000/60000 (17%)] Loss: 2.295907 Train Epoch: 1 [20000/60000 (33%)] Loss: 2.273554 Train Epoch: 1 [30000/60000 (50%)] Loss: 2.260596 /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return F.log_softmax(x) /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return F.log_softmax(x) Train Epoch: 1 [40000/60000 (67%)] Loss: 2.248712 Train Epoch: 1 [50000/60000 (83%)] Loss: 2.233172 Train Epoch: 2 [0/60000 (0%)] Loss: 2.249660 Train Epoch: 2 [10000/60000 (17%)] Loss: 2.155293 Train Epoch: 2 [20000/60000 (33%)] Loss: 2.051280 Train Epoch: 2 [30000/60000 (50%)] Loss: 1.962492 Train Epoch: 2 [40000/60000 (67%)] Loss: 1.912481 Train Epoch: 2 [50000/60000 (83%)] Loss: 1.886552 Train Epoch: 3 [0/60000 (0%)] Loss: 1.862257 Train Epoch: 3 [10000/60000 (17%)] Loss: 1.675992 Train Epoch: 3 [20000/60000 (33%)] Loss: 1.436601 Train Epoch: 3 [30000/60000 (50%)] Loss: 1.384962 Train Epoch: 3 [40000/60000 (67%)] Loss: 1.489862 Train Epoch: 3 [50000/60000 (83%)] Loss: 1.324469 2024/12/11 22:52:02 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /local_disk0/repl_tmp_data/ReplId-193b7-d9d6b-5/tmpe5nilt11/model/data, flavor: pytorch). Fall back to return ['torch==2.3.1', 'cloudpickle==2.2.1']. Set logging level to DEBUG to see the full traceback. 2024/12/11 22:52:02 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.15.1/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model.
2024/12/11 22:52:03 WARNING mlflow.models.model: Input example should be provided to infer model signature if the model signature is not provided when logging the model. Average test loss: 0.761795163154602 2024/12/11 22:52:14 INFO mlflow.tracking._tracking_service.client: 🏃 View run righteous-shrimp-206 at: e2-dogfood.staging.cloud.databricks.com/ml/experiments/2280852430858633/runs/434989bb483d4d48b1649f5050a5efee. 2024/12/11 22:52:14 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: e2-dogfood.staging.cloud.databricks.com/ml/experiments/2280852430858633.

Single node multi-GPU training

PyTorch provides a roundabout way for doing single node multi-GPU training. Databricks provides a more streamlined solution that allows you to move from single node multi-GPU to multi node training seamlessly. To do single node multi-GPU training on Databricks, you need to invoke the TorchDistributor API and set num_processes equal to the number of available GPUs on the driver node that you want to use and set local_mode=True.

14

Data is located at: /dbfs/ml/pytorch/1733957534.7987056 INFO:TorchDistributor:Started local training with 2 processes INFO:TorchDistributor:Finished local training with 2 processes WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Wed Dec 11 22:52:22 2024 Connection to spark from PID 7235 Wed Dec 11 22:52:22 2024 Initialized gateway on port 46337 Wed Dec 11 22:52:22 2024 Connection to spark from PID 7236 Wed Dec 11 22:52:22 2024 Initialized gateway on port 36391 Wed Dec 11 22:52:22 2024 Connected to spark. Wed Dec 11 22:52:22 2024 Connected to spark. Running distributed training Running distributed training /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. Train Epoch: 1 [0/30000 (0%)] Loss: 2.344287 Train Epoch: 1 [0/30000 (0%)] Loss: 2.319191 Train Epoch: 1 [10000/30000 (33%)] Loss: 2.286878 Train Epoch: 1 [10000/30000 (33%)] Loss: 2.311805 Train Epoch: 1 [20000/30000 (67%)] Loss: 2.266556 Train Epoch: 1 [20000/30000 (67%)] Loss: 2.279220 Train Epoch: 2 [0/30000 (0%)] Loss: 2.276896 Train Epoch: 2 [0/30000 (0%)] Loss: 2.257281 Train Epoch: 2 [10000/30000 (33%)] Loss: 2.268584 Train Epoch: 2 [10000/30000 (33%)] Loss: 2.232745 Train Epoch: 2 [20000/30000 (67%)] Loss: 2.242372 Train Epoch: 2 [20000/30000 (67%)] Loss: 2.192771 Train Epoch: 3 [0/30000 (0%)] Loss: 2.207398 Train Epoch: 3 [0/30000 (0%)] Loss: 2.233767 Train Epoch: 3 [10000/30000 (33%)] Loss: 2.163989 Train Epoch: 3 [10000/30000 (33%)] Loss: 2.120047 Train Epoch: 3 [20000/30000 (67%)] Loss: 2.066106 Train Epoch: 3 [20000/30000 (67%)] Loss: 2.087892 2024/12/11 22:52:44 INFO mlflow.utils.databricks_utils: No workspace ID specified; if your Databricks workspaces share the same host URL, you may want to specify the workspace ID (along with the host information in the secret manager) for run lineage tracking. For more details on how to specify this information in the secret manager, please refer to the Databricks MLflow documentation. 2024/12/11 22:52:48 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/repl_tmp_data/ReplId-193b7-d9d6b-5/tmpuivzyo93/model/data, flavor: pytorch). Fall back to return ['torch==2.3.1', 'cloudpickle==2.2.1']. Set logging level to DEBUG to see the full traceback. /databricks/python/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") 2024/12/11 22:52:48 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.15.1/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model. Uploading artifacts: 100%|██████████| 10/10 [00:00<00:00, 10.35it/s] 2024/12/11 22:52:49 WARNING mlflow.models.model: Input example should be provided to infer model signature if the model signature is not provided when logging the model. /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. Average test loss: 1.9049434661865234 2024/12/11 22:52:59 INFO mlflow.tracking._tracking_service.client: 🏃 View run amusing-shrike-121 at: https://e2-dogfood.staging.cloud.databricks.com/ml/experiments/2280852430858633/runs/78a38eed946942d89de4ac5ba935ab22. 2024/12/11 22:52:59 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://e2-dogfood.staging.cloud.databricks.com/ml/experiments/2280852430858633. /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return F.log_softmax(x) Average test loss: 1.9049434661865234 2024/12/11 22:53:16 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.15.1/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model.
2024/12/11 22:53:17 WARNING mlflow.models.model: Input example should be provided to infer model signature if the model signature is not provided when logging the model.

Multi-node training

To move from single node multi-GPU training to multi-node training, you just change num_processes to the number of GPUs that you want to use across all worker nodes. This example uses all available GPUs (NUM_GPUS_PER_NODE * NUM_WORKERS). You also change local_mode to False. Additionally, to configure how many GPUs to use for each Spark task that runs the train function, set spark.task.resource.gpu.amount <num_gpus_per_task> in the Spark Config cell on the cluster page before creating the cluster.

16

Data is located at: /dbfs/ml/pytorch/1733957597.6578755 INFO:TorchDistributor:Started distributed training with 2 executor processes Running distributed training Running distributed training /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. Train Epoch: 1 [0/30000 (0%)] Loss: 2.339231 Train Epoch: 1 [0/30000 (0%)] Loss: 2.292328 Train Epoch: 1 [10000/30000 (33%)] Loss: 2.310895 Train Epoch: 1 [10000/30000 (33%)] Loss: 2.310132 Train Epoch: 1 [20000/30000 (67%)] Loss: 2.292369 Train Epoch: 1 [20000/30000 (67%)] Loss: 2.288747 Train Epoch: 2 [0/30000 (0%)] Loss: 2.267363 Train Epoch: 2 [0/30000 (0%)] Loss: 2.250873 Train Epoch: 2 [10000/30000 (33%)] Loss: 2.252213 Train Epoch: 2 [10000/30000 (33%)] Loss: 2.242889 Train Epoch: 2 [20000/30000 (67%)] Loss: 2.257112 Train Epoch: 2 [20000/30000 (67%)] Loss: 2.214770 Train Epoch: 3 [0/30000 (0%)] Loss: 2.203631 Train Epoch: 3 [0/30000 (0%)] Loss: 2.226314 Train Epoch: 3 [10000/30000 (33%)] Loss: 2.151444 Train Epoch: 3 [10000/30000 (33%)] Loss: 2.153582 Train Epoch: 3 [20000/30000 (67%)] Loss: 2.086527 Train Epoch: 3 [20000/30000 (67%)] Loss: 2.095566 Spark Command: /usr/lib/jvm/zulu8-ca-amd64/jre/bin/java -cp /databricks/spark/conf/:/databricks/spark/assembly/target/scala-2.12/jars/*:/databricks/spark/dbconf/log4j/master-worker/:/databricks/jars/* -Xmx1g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED --add-opens=jdk.jfr/jdk.jfr.internal.consumer=ALL-UNNAMED --add-opens=jdk.jfr/jdk.jfr.internal=ALL-UNNAMED --add-opens=java.management/sun.management=ALL-UNNAMED --add-opens=java.base/jdk.internal.loader=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Dderby.connection.requireAuthentication=false org.apache.spark.deploy.SparkSubmit pyspark-shell ======================================== WARN StatusConsoleListener The use of package scanning to locate plugins is deprecated and will be removed in a future release WARN StatusConsoleListener The use of package scanning to locate plugins is deprecated and will be removed in a future release WARN StatusConsoleListener The use of package scanning to locate plugins is deprecated and will be removed in a future release WARN StatusConsoleListener The use of package scanning to locate plugins is deprecated and will be removed in a future release WARN StatusConsoleListener RollingFileAppender 'com.databricks.logging.structured.PrometheusMetricsSnapshot.appender': The bufferSize is set to 8192 but bufferedIO is not true 24/12/11 22:54:16 INFO DatabricksEdgeConfigs: serverlessEnabled : false 24/12/11 22:54:17 INFO DatabricksEdgeConfigs: perfPackEnabled : false 24/12/11 22:54:17 INFO DatabricksEdgeConfigs: classicSqlEnabled : false 24/12/11 22:54:18 INFO RawConfigSingleton$: Successfully loaded DB_CONF into RawConfigSingleton. 24/12/11 22:54:19 INFO SecurityManager: Changing view acls to: root 24/12/11 22:54:19 INFO SecurityManager: Changing modify acls to: root 24/12/11 22:54:19 INFO SecurityManager: Changing view acls groups to: 24/12/11 22:54:19 INFO SecurityManager: Changing modify acls groups to: 24/12/11 22:54:19 INFO SecurityManager: SecurityManager: authentication is enabled: false; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/12/11 22:54:19 INFO SparkContext: Running Spark version 3.5.0 24/12/11 22:54:19 INFO SparkContext: OS info Linux, 5.15.0-1072-aws, amd64 24/12/11 22:54:19 INFO SparkContext: Java version 1.8.0_412 24/12/11 22:54:19 INFO ResourceUtils: ============================================================== 24/12/11 22:54:19 INFO ResourceUtils: No custom resources configured for spark.driver. 24/12/11 22:54:19 INFO ResourceUtils: ============================================================== 24/12/11 22:54:19 INFO SparkContext: Submitted application: pyspark-shell 24/12/11 22:54:19 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 24/12/11 22:54:19 INFO ResourceProfile: Limiting resource is cpu 24/12/11 22:54:19 INFO ResourceProfileManager: Added ResourceProfile id: 0 24/12/11 22:54:19 INFO SecurityManager: Changing view acls to: root 24/12/11 22:54:19 INFO SecurityManager: Changing modify acls to: root 24/12/11 22:54:19 INFO SecurityManager: Changing view acls groups to: 24/12/11 22:54:19 INFO SecurityManager: Changing modify acls groups to: 24/12/11 22:54:19 INFO SecurityManager: SecurityManager: authentication is enabled: false; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled 24/12/11 22:54:20 INFO Utils: Successfully started service 'sparkDriver' on port 44335. 24/12/11 22:54:20 INFO SparkEnv: Registering MapOutputTracker 24/12/11 22:54:20 INFO SparkEnv: Registering BlockManagerMaster 24/12/11 22:54:20 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 24/12/11 22:54:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 24/12/11 22:54:20 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 24/12/11 22:54:20 INFO DiskBlockManager: Created local directory at /local_disk0/spark-6084ceb8-465b-493b-a78e-1726579dc598/executor-a3d9fc8f-9b96-47b3-84b0-dc6e33edd3e0/blockmgr-7fca0165-4072-43ae-b5c3-24df05afd7a9 24/12/11 22:54:20 INFO SparkEnv: Registering OutputCommitCoordinator 24/12/11 22:54:20 WARN JfrStreamingManager: JFR streaming is only available in JDK 17+ 24/12/11 22:54:20 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set. 24/12/11 22:54:20 INFO log: Logging initialized @8449ms to org.eclipse.jetty.util.log.Slf4jLog 24/12/11 22:54:21 INFO JettyUtils: Start Jetty 10.68.143.139:40001 for SparkUI 24/12/11 22:54:21 INFO Server: jetty-9.4.52.v20230823; built: 2023-08-23T19:29:37.669Z; git: abdcda73818a1a2c705da276edb0bf6581e7997e; jvm 1.8.0_412-b08 24/12/11 22:54:21 INFO Server: Started @8723ms 24/12/11 22:54:21 INFO AbstractConnector: Started ServerConnector@17664041{HTTP/1.1, (http/1.1)}{10.68.143.139:40001} 24/12/11 22:54:21 INFO Utils: Successfully started service 'SparkUI' on port 40001. 24/12/11 22:54:21 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@3a8290a{/,null,AVAILABLE,@Spark} 24/12/11 22:54:22 INFO DriverPluginContainer: Initialized driver component for plugin org.apache.spark.sql.connect.SparkConnectPlugin. 24/12/11 22:54:22 INFO DriverPluginContainer: Initialized driver component for plugin com.databricks.spark.connect.LocalSparkConnectPlugin. 24/12/11 22:54:22 INFO DLTDebugger: Registered DLTDebuggerEndpoint at endpoint dlt-debugger 24/12/11 22:54:22 INFO DriverPluginContainer: Initialized driver component for plugin org.apache.spark.debugger.DLTDebuggerSparkPlugin. 24/12/11 22:54:22 INFO SecurityManager: Changing view acls to: root 24/12/11 22:54:22 INFO SecurityManager: Changing modify acls to: root 24/12/11 22:54:22 INFO SecurityManager: Changing view acls groups to: 24/12/11 22:54:22 INFO SecurityManager: Changing modify acls groups to: 24/12/11 22:54:22 INFO SecurityManager: SecurityManager: authentication is enabled: false; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled 24/12/11 22:54:22 INFO Executor: Starting executor ID driver on host ip-10-68-143-139.us-west-2.compute.internal 24/12/11 22:54:22 INFO Executor: OS info Linux, 5.15.0-1072-aws, amd64 24/12/11 22:54:22 INFO Executor: Java version 1.8.0_412 24/12/11 22:54:22 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): '' 24/12/11 22:54:22 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@38bce0b6 for default. 24/12/11 22:54:22 INFO ExecutorPluginContainer: Initialized executor component for plugin org.apache.spark.debugger.DLTDebuggerSparkPlugin. 24/12/11 22:54:22 INFO Utils: resolved command to be run: ArraySeq(getconf, PAGESIZE) 24/12/11 22:54:23 WARN NativeMemoryWatchdog: Native memory watchdog is disabled by conf. 24/12/11 22:54:23 INFO TaskSchedulerImpl: Preemption disabled in FIFO scheduling mode. 24/12/11 22:54:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35711. 24/12/11 22:54:23 INFO NettyBlockTransferService: Server created on ip-10-68-143-139.us-west-2.compute.internal:35711 24/12/11 22:54:23 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 24/12/11 22:54:23 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-10-68-143-139.us-west-2.compute.internal, 35711, None) 24/12/11 22:54:23 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-68-143-139.us-west-2.compute.internal:35711 with 366.3 MiB RAM, BlockManagerId(driver, ip-10-68-143-139.us-west-2.compute.internal, 35711, None) 24/12/11 22:54:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-10-68-143-139.us-west-2.compute.internal, 35711, None) 24/12/11 22:54:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-10-68-143-139.us-west-2.compute.internal, 35711, None) 24/12/11 22:54:25 INFO ContextHandler: Stopped o.e.j.s.ServletContextHandler@3a8290a{/,null,STOPPED,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@49bd8292{/jobs,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@5a3edc6{/jobs/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@2c09c453{/jobs/job,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@44a52971{/jobs/job/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@147b9e4d{/stages,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@396810cf{/stages/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@67183f23{/stages/stage,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@24bffb66{/stages/stage/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@6146fed2{/stages/pool,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@5b09f7cd{/stages/pool/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@43a1201e{/stages/taskThreadDump,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@6545830a{/stages/taskThreadDump/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@523fcb3{/storage,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@8be321a{/storage/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@1985d911{/storage/rdd,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@74cb9e88{/storage/rdd/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@b7d8cef{/environment,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@4c07b9dc{/environment/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@2bc48600{/executors,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@3caa2ec3{/executors/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@34e3783e{/executors/threadDump,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@1a733a6{/executors/threadDump/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@55a7a5ff{/executors/heapHistogram,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@37cf6b1a{/executors/heapHistogram/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@752f7104{/executors/heapHistogram,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@5cc665a6{/executors/heapHistogram/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@32f83721{/static,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@14c9fab7{/,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@7aae5f3f{/api,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@e62d909{/metrics,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@19f49448{/jobs/job/kill,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@dfa539e{/stages/stage/kill,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@1dfcd39e{/connect,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@7650fd03{/connect/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@17eec73f{/connect/session,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@6c632e3{/connect/session/json,null,AVAILABLE,@Spark} 24/12/11 22:54:25 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@2f3dffd{/metrics/json,null,AVAILABLE,@Spark} 2024/12/11 22:54:26 INFO mlflow.utils.databricks_utils: No workspace ID specified; if your Databricks workspaces share the same host URL, you may want to specify the workspace ID (along with the host information in the secret manager) for run lineage tracking. For more details on how to specify this information in the secret manager, please refer to the Databricks MLflow documentation. 2024/12/11 22:54:30 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpqi4e38ph/model/data, flavor: pytorch). Fall back to return ['torch==2.3.1', 'cloudpickle==2.2.1']. Set logging level to DEBUG to see the full traceback. /databricks/python/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") 2024/12/11 22:54:30 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.15.1/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model. Uploading artifacts: 100%|██████████| 10/10 [00:00<00:00, 11.79it/s] 2024/12/11 22:54:31 WARNING mlflow.models.model: Input example should be provided to infer model signature if the model signature is not provided when logging the model. /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. Average test loss: 1.9135884046554565 2024/12/11 22:54:42 INFO mlflow.tracking._tracking_service.client: 🏃 View run clean-fawn-311 at: https://oregon.staging.cloud.databricks.com/ml/experiments/2280852430858633/runs/10ccc2655b19485d992d6ce93a76676c. 2024/12/11 22:54:42 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://oregon.staging.cloud.databricks.com/ml/experiments/2280852430858633. 24/12/11 22:54:43 INFO SparkContext: Invoking stop() from shutdown hook 24/12/11 22:54:43 INFO SparkContext: SparkContext is stopping with exitCode 0 from run at Executors.java:511. 24/12/11 22:54:43 WARN SparkContext: Requesting executors is not supported by current scheduler. 24/12/11 22:54:43 INFO AbstractConnector: Stopped Spark@17664041{HTTP/1.1, (http/1.1)}{10.68.143.139:40001} 24/12/11 22:54:43 INFO SparkUI: Stopped Spark web UI at http://10.68.143.139:40001 24/12/11 22:54:43 INFO DeadlockDetector: Trigger deadlock detection immediately. 24/12/11 22:54:43 INFO ContextHandler: Stopped o.e.j.s.ServletContextHandler@1dfcd39e{/connect,null,STOPPED,@Spark} 24/12/11 22:54:43 INFO ContextHandler: Stopped o.e.j.s.ServletContextHandler@7650fd03{/connect/json,null,STOPPED,@Spark} 24/12/11 22:54:43 INFO ContextHandler: Stopped o.e.j.s.ServletContextHandler@17eec73f{/connect/session,null,STOPPED,@Spark} 24/12/11 22:54:43 INFO ContextHandler: Stopped o.e.j.s.ServletContextHandler@6c632e3{/connect/session/json,null,STOPPED,@Spark} 24/12/11 22:55:13 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) 24/12/11 22:55:13 INFO DriverPluginContainer: Exception while shutting down plugin com.databricks.spark.connect.LocalSparkConnectPlugin. java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1265) at com.databricks.spark.connect.LocalSparkConnectPlugin$$anon$1.shutdown(LocalSparkConnectPlugin.scala:81) at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$shutdown$1(PluginContainer.scala:85) at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$shutdown$1$adapted(PluginContainer.scala:82) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.internal.plugin.DriverPluginContainer.shutdown(PluginContainer.scala:82) at org.apache.spark.SparkContext.$anonfun$stop$19(SparkContext.scala:2976) at org.apache.spark.SparkContext.$anonfun$stop$19$adapted(SparkContext.scala:2976) at scala.Option.foreach(Option.scala:407) at org.apache.spark.SparkContext.$anonfun$stop$18(SparkContext.scala:2976) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1537) at org.apache.spark.SparkContext.stop(SparkContext.scala:2976) at org.apache.spark.SparkContext.stop(SparkContext.scala:2902) at org.apache.spark.SparkContext.$anonfun$new$53(SparkContext.scala:1012) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:237) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:211) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2192) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:211) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:211) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:190) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/12/11 22:55:13 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB 24/12/11 22:55:13 INFO MemoryStore: MemoryStore cleared 24/12/11 22:55:13 INFO BlockManager: BlockManager stopped 24/12/11 22:55:13 INFO BlockManagerMaster: BlockManagerMaster stopped 24/12/11 22:55:13 INFO MetricsSystem: Stopping driver MetricsSystem 24/12/11 22:55:13 INFO DeadlockDetectorManager: Stopping all deadlock detection tasks. 24/12/11 22:55:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/12/11 22:55:13 INFO SparkContext: Successfully stopped SparkContext 24/12/11 22:55:13 INFO privateLog: "disk_health_monitor" #51 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7c7b9519 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2083) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "disk_health_monitor" #60 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2c6646f0 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2083) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "Finalizer" #3 WAITING java.lang.Object.wait(Native Method) - waiting on java.lang.ref.ReferenceQueue$Lock@6aa9bab java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:188) 24/12/11 22:55:13 INFO privateLog: "main" #1 WAITING holding [java.lang.Class@798db0e4] java.lang.Object.wait(Native Method) - waiting on org.apache.hadoop.util.ShutdownHookManager$1@691541bc java.lang.Thread.join(Thread.java:1257) java.lang.Thread.join(Thread.java:1331) java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) java.lang.Shutdown.runHooks(Shutdown.java:130) java.lang.Shutdown.exit(Shutdown.java:178) - locked java.lang.Class@798db0e4 java.lang.Runtime.exit(Runtime.java:104) java.lang.System.exit(System.java:987) org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:75) org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:498) org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1017) org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:187) org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:210) org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:82) org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1107) org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1116) org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 24/12/11 22:55:13 INFO privateLog: "node_status_monitor" #52 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1b55d368 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2083) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "node_status_monitor" #61 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@63964445 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2083) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "process reaper" #9 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@33d1c02a java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "Reference Handler" #2 WAITING java.lang.Object.wait(Native Method) - waiting on java.lang.ref.Reference$Lock@26d9053a java.lang.Object.wait(Object.java:502) java.lang.ref.Reference.tryHandlePending(Reference.java:191) java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) 24/12/11 22:55:13 INFO privateLog: "rpc-boss-3-1" #15 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "rpc-client-1-1" #98 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "rpc-client-1-2" #99 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "rpc-server-4-1" #96 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "rpc-server-4-2" #97 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "shuffle-boss-6-1" #66 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "shuffle-client-5-1" #94 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "shuffle-client-5-2" #95 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "shuffle-server-7-1" #92 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "shuffle-server-7-2" #93 TIMED_WAITING java.lang.Thread.sleep(Native Method) io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:790) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:596) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "shutdown-hook-0" #79 RUNNABLE holding [java.util.concurrent.ThreadPoolExecutor$Worker@1e63ba59] sun.management.ThreadImpl.dumpThreads0(Native Method) sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:496) sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:484) org.apache.spark.util.Utils$.getThreadDump(Utils.scala:2407) org.apache.spark.util.Utils$.logThreadDump(Utils.scala:2490) org.apache.spark.util.Utils$.$anonfun$addThreadDumpShutdownHook$1(Utils.scala:2483) org.apache.spark.util.Utils$$$Lambda$871/1159658512.apply$mcV$sp(Unknown Source) org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:237) org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:211) org.apache.spark.util.SparkShutdownHookManager$$Lambda$2144/686498783.apply$mcV$sp(Unknown Source) scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2192) org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:211) org.apache.spark.util.SparkShutdownHookManager$$Lambda$2143/640473450.apply$mcV$sp(Unknown Source) scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) scala.util.Try$.apply(Try.scala:213) org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:211) org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:190) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.run(FutureTask.java:266) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "Signal Dispatcher" #4 RUNNABLE 24/12/11 22:55:13 INFO privateLog: "Thread-1" #10 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2ae9b417 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2083) java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675) org.apache.hadoop.util.ShutdownHookManager.shutdownExecutor(ShutdownHookManager.java:146) org.apache.hadoop.util.ShutdownHookManager.access$300(ShutdownHookManager.java:65) org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:102) 24/12/11 22:55:13 INFO privateLog: "Thread-16" #49 TIMED_WAITING sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2d1e42bf java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2083) java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522) java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684) sun.nio.fs.AbstractWatchService.poll(AbstractWatchService.java:108) com.databricks.spark.connect.service.LocalSparkConnectService$.waitUntilFileExists(LocalSparkConnectService.scala:70) com.databricks.spark.connect.service.LocalSparkConnectService$.startGRPCService(LocalSparkConnectService.scala:123) com.databricks.spark.connect.service.LocalSparkConnectService$.start(LocalSparkConnectService.scala:155) com.databricks.spark.connect.LocalSparkConnectPlugin$$anon$1$$anon$2.run(LocalSparkConnectPlugin.scala:64) 24/12/11 22:55:13 INFO privateLog: "Thread-17" #50 RUNNABLE sun.nio.fs.LinuxWatchService.poll(Native Method) sun.nio.fs.LinuxWatchService.access$600(LinuxWatchService.java:47) sun.nio.fs.LinuxWatchService$Poller.run(LinuxWatchService.java:314) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "Thread-18" #57 TIMED_WAITING java.lang.Thread.sleep(Native Method) com.databricks.spark.util.ExternalLogAggregator.run(ExternalLogAggregator.scala:106) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO privateLog: "Thread-3" #12 RUNNABLE java.net.PlainSocketImpl.socketAccept(Native Method) java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) java.net.ServerSocket.implAccept(ServerSocket.java:571) java.net.ServerSocket.accept(ServerSocket.java:534) py4j.GatewayServer.run(GatewayServer.java:705) java.lang.Thread.run(Thread.java:750) 24/12/11 22:55:13 INFO ShutdownHookManager: Shutdown hook called 24/12/11 22:55:13 INFO ShutdownHookManager: Deleting directory /local_disk0/spark-6084ceb8-465b-493b-a78e-1726579dc598/executor-a3d9fc8f-9b96-47b3-84b0-dc6e33edd3e0/spark-0e8b90d6-6341-40d9-aa22-eb475660167b 24/12/11 22:55:13 INFO ShutdownHookManager: Deleting directory /tmp/spark-0cbec9c4-6e22-4fbd-8952-7a57a6044c53 24/12/11 22:55:13 INFO ShutdownHookManager: Deleting directory /local_disk0/spark-6084ceb8-465b-493b-a78e-1726579dc598/executor-a3d9fc8f-9b96-47b3-84b0-dc6e33edd3e0/spark-0e8b90d6-6341-40d9-aa22-eb475660167b/pyspark-57a5edc4-a20a-4865-83f3-218d05fe34d4 INFO:TorchDistributor:Finished distributed training with 2 executor processes /root/.ipykernel/4182/command-2280852430858390-193987937:30: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return F.log_softmax(x) Average test loss: 1.9135884046554565 2024/12/11 22:55:26 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.15.1/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model.
2024/12/11 22:55:27 WARNING mlflow.models.model: Input example should be provided to infer model signature if the model signature is not provided when logging the model.