Advanced usage of Databricks Connect
This article covers Databricks Connect for Databricks Runtime 14.0 and above.
This article describes topics that go beyond the basic setup of Databricks Connect.
Configure the Spark Connect connection string
In addition to connecting to your cluster using the options outlined in Configure a connection to a cluster, a more advanced option is connecting using the Spark Connect connection string. You can pass the string in the remote
function or set the SPARK_REMOTE
environment variable.
You can only use a Databricks personal access token authentication to connect using the Spark Connect connection string.
- Python
- Scala
To set the connection string using the remote
function:
from databricks.connect import DatabricksSession
workspace_instance_name = retrieve_workspace_instance_name()
token = retrieve_token()
cluster_id = retrieve_cluster_id()
spark = DatabricksSession.builder.remote(
f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}"
).getOrCreate()
Alternatively, set the SPARK_REMOTE
environment variable:
sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the DatabricksSession
class:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Set the SPARK_REMOTE
environment variable:
sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the DatabricksSession
class:
import com.databricks.connect.DatabricksSession
val spark = DatabricksSession.builder.getOrCreate()
Additional HTTP headers
Databricks Connect communicates with the Databricks Clusters via gRPC over HTTP/2.
To have better control over the requests coming from clients, advanced users may choose to install a proxy service between the client and the Databricks cluster. In some cases the proxies may require custom headers in the HTTP requests.
Use the header() method to add custom headers to HTTP requests:
- Python
- Scala
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.header('x-custom-header', 'value').getOrCreate()
import com.databricks.connect.DatabricksSession
val spark = DatabricksSession.builder.header("x-custom-header", "value").getOrCreate()
Certificates
If your cluster relies on a custom SSL/TLS certificate to resolve a Databricks workspace fully qualified domain name (FQDN), you must set the environment variable GRPC_DEFAULT_SSL_ROOTS_FILE_PATH
on your local development machine. This environment variable must be set to the full path to the installed certificate on the cluster.
- Python
- Scala
The following example sets this environment variable:
import os
os.environ["GRPC_DEFAULT_SSL_ROOTS_FILE_PATH"] = "/etc/ssl/certs/ca-bundle.crt"
For other ways to set environment variables, see your operating system's documentation.
Java and Scala do not offer ways to configure environment variables programmatically. Refer to your operating system or IDE documentation for information on how to configure them as part of your application.
Logging and debug logs
- Python
- Scala
Databricks Connect for Python produces logs using standard Python logging.
Logs are emitted to the standard error stream (stderr) and by default they are turned off.
Setting an environment variable SPARK_CONNECT_LOG_LEVEL=debug
will modify this default and print all log messages at the DEBUG
level and higher.
Databricks Connect for Scala uses SLF4J logging, and does not ship with any SLF4J providers.
Applications using Databricks Connect are expected to include an SLF4J provider, and in some cases, configured to print the log messages.
- The simplest option is to include the slf4j-simple provider which prints log messages at the
INFO
level and higher to the standard error stream (stderr). - A more configurable alternative is to use the slf4j-reload4j provider which picks up configuration from a
log4j.properties
file in the classpath.
The following example shows a simple log4j.properties
file.
log4j.rootLogger=INFO,stderr
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%p\t%d{ISO8601}\t%r\t%c\t[%t]\t%m%n
In the preceding example, debug logs are printed if the root logger (or a specific logger) is configured at the DEBUG
level:
log4j.rootLogger=DEBUG,stderr