Query databases using JDBC

Databricks supports connecting to external databases using JDBC. This article provides the basic syntax for configuring and using these connections with examples in Python, SQL, and Scala.

Experimental

The legacy query federation documentation has been retired and might not be updated. The configurations mentioned in this content are not officially endorsed or tested by Databricks. If Lakehouse Federation supports your source database, Databricks recommends using that instead.

Partner Connect provides optimized integrations for syncing data with many external external data sources. See What is Databricks Partner Connect?.

important

The examples in this article do not include usernames and passwords in JDBC URLs. Databricks recommends using secrets to store your database credentials. For example:

Python
Scala

Python
username = dbutils.secrets.get(scope = "jdbc", key = "username")
password = dbutils.secrets.get(scope = "jdbc", key = "password")

Scala
val username = dbutils.secrets.get(scope = "jdbc", key = "username")
val password = dbutils.secrets.get(scope = "jdbc", key = "password")

To reference Databricks secrets with SQL, you must configure a Spark configuration property during cluster initilization.

For a full example of secret management, see Tutorial: Create and use a Databricks secret.

Establish cloud connectivity

Databricks VPCs are configured to allow only Spark clusters. When connecting to another infrastructure, the best practice is to use VPC peering. Once VPC peering is established, you can check with the netcat utility on the cluster.

Bash
%sh nc -vz <jdbcHostname> <jdbcPort>

Read data with JDBC

You must configure a number of settings to read data using JDBC. Note that each database uses a different format for the <jdbc-url>.

Python
SQL
Scala

Python
employees_table = (spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .load()
)

SQL
CREATE TEMPORARY VIEW employees_table_vw
USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>'
)

Scala
val employees_table = spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .load()

Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.

Python
SQL
Scala

Python
employees_table.printSchema

SQL
DESCRIBE employees_table_vw

Scala
employees_table.printSchema

You can run queries against this JDBC table:

Python
SQL
Scala

Python
display(employees_table.select("age", "salary").groupBy("age").avg("salary"))

SQL
SELECT age, avg(salary) as salary
FROM employees_table_vw
GROUP BY age

Scala
display(employees_table.select("age", "salary").groupBy("age").avg("salary"))

Write data with JDBC

Saving data to tables with JDBC uses similar configurations to reading. See the following example:

Python
SQL
Scala

Python
(employees_table.write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .save()
)

SQL
CREATE TABLE new_employees_table
  USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>'
) AS
SELECT * FROM employees_table_vw

Scala
employees_table.write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .save()

The default behavior attempts to create a new table and throws an error if a table with that name already exists.

You can append data to an existing table using the following syntax:

Python
SQL
Scala

Python
(employees_table.write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .mode("append")
  .save()
)

SQL
CREATE TABLE IF NOT EXISTS new_employees_table
  USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>'
);

INSERT INTO new_employees_table
SELECT * FROM employees_table_vw;

Scala
employees_table.write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .mode("append")
  .save()

You can overwrite an existing table using the following syntax:

Python
SQL
Scala

Python
(employees_table.write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .mode("overwrite")
  .save()
)

SQL
CREATE OR REPLACE TABLE new_employees_table
  USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>'
) AS
SELECT * FROM employees_table_vw;

Scala
employees_table.write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .mode("overwrite")
  .save()

Control parallelism for JDBC queries

By default, the JDBC driver queries the source database with only a single thread. To improve performance for reads, you need to specify a number of options to control how many simultaneous queries Databricks makes to your database. For small clusters, setting the numPartitions option equal to the number of executor cores in your cluster ensures that all nodes query data in parallel.

warning

Setting numPartitions to a high value on a large cluster can result in negative performance for the remote database, as too many simultaneous queries might overwhelm the service. This is especially troublesome for application databases. Be wary of setting this value above 50.

note

Speed up queries by selecting a column with an index calculated in the source database for the partitionColumn.

The following code example demonstrates configuring parallelism for a cluster with eight cores:

Python
SQL
Scala

Python
employees_table = (spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  # a column that can be used that has a uniformly distributed range of values that can be used for parallelization
  .option("partitionColumn", "<partition-key>")
  # lowest value to pull data for with the partitionColumn
  .option("lowerBound", "<min-value>")
  # max value to pull data for with the partitionColumn
  .option("upperBound", "<max-value>")
  # number of partitions to distribute the data into. Do not set this very large (~hundreds)
  .option("numPartitions", 8)
  .load()
)

SQL
CREATE TEMPORARY VIEW employees_table_vw
USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>',
  partitionColumn "<partition-key>",
  lowerBound "<min-value>",
  upperBound "<max-value>",
  numPartitions 8
)

Scala
val employees_table = spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  // a column that can be used that has a uniformly distributed range of values that can be used for parallelization
  .option("partitionColumn", "<partition-key>")
  // lowest value to pull data for with the partitionColumn
  .option("lowerBound", "<min-value>")
  // max value to pull data for with the partitionColumn
  .option("upperBound", "<max-value>")
  // number of partitions to distribute the data into. Do not set this very large (~hundreds)
  .option("numPartitions", 8)
  .load()

note

Databricks supports all Apache Spark options for configuring JDBC.

When writing to databases using JDBC, Apache Spark uses the number of partitions in memory to control parallelism. You can repartition data before writing to control parallelism. Avoid high number of partitions on large clusters to avoid overwhelming your remote database. The following example demonstrates repartitioning to eight partitions before writing:

Python
SQL
Scala

Python
(employees_table.repartition(8)
  .write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .save()
)

SQL
CREATE TABLE new_employees_table
  USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>'
) AS
SELECT /*+ REPARTITION(8) */ * FROM employees_table_vw

Scala
employees_table.repartition(8)
  .write
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<new-table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .save()

Push down a query to the database engine

You can push down an entire query to the database and return just the result. The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause.

Python
SQL
Scala

Python
pushdown_query = "(select * from employees where emp_no < 10008) as emp_alias"

employees_table = (spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", pushdown_query)
  .option("user", "<username>")
  .option("password", "<password>")
  .load()
)

SQL
CREATE TEMPORARY VIEW employees_table_vw
USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "(select * from employees where emp_no < 10008) as emp_alias",
  user '<username>',
  password '<password>'
)

Scala
val pushdown_query = "(select * from employees where emp_no < 10008) as emp_alias"

val employees_table = spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", pushdown_query)
  .option("user", "<username>")
  .option("password", "<password>")
  .load()

Control number of rows fetched per query

JDBC drivers have a fetchSize parameter that controls the number of rows fetched at a time from the remote database.

Setting	Result
Too low	High latency due to many roundtrips (few rows returned per query)
Too high	Out of memory error (too much data returned in one query)

The optimal value is workload dependent. Considerations include:

How many columns are returned by the query?
What data types are returned?
How long are the strings in each column returned?

Systems might have very small default and benefit from tuning. For example: Oracle's default fetchSize is 10. Increasing it to 100 reduces the number of total queries that need to be executed by a factor of 10. JDBC results are network traffic, so avoid very large numbers, but optimal values might be in the thousands for many datasets.

Use the fetchSize option, as in the following example:

Python
SQL
Scala

Python
employees_table = (spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .option("fetchSize", "100")
  .load()
)

SQL
CREATE TEMPORARY VIEW employees_table_vw
USING JDBC
OPTIONS (
  url "<jdbc-url>",
  dbtable "<table-name>",
  user '<username>',
  password '<password>'.
  fetchSize 100
)

Scala
val employees_table = spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .option("fetchSize", "100")
  .load()

Establish cloud connectivity​

Read data with JDBC​

Write data with JDBC​

Control parallelism for JDBC queries​

Push down a query to the database engine​

Control number of rows fetched per query​