Databases and Tables

Tables are a simple way to make structured data available across your organization. A user can query tables with Spark SQL or any of Apache Spark’s language APIs. A user defines a table using the Create Table UI (Creating Tables) or programmatically (Create Tables Programmatically).

Databricks uses the Hive metastore to manage tables, and supports all file formats and Hive data sources. Tables are no different from Apache Spark DataFrames. That means that you can cache them, filter them and perform any other relevant Spark operations on them.


Use the Spark SQL Language Manual as a reference for SQL commands.

Choosing a cluster

In Databricks, creating and viewing databases and tables requires you to select a cluster. By default, Databricks will choose any cluster to which you currently have access. The cluster used can be changed at any time from the Data sidebar, Create Table UI, or the View Table UI.

To change the cluster used from the Data sidebar, click Tables Icon, then open the cluster selector by clicking Data Cluster Dropdown Icon at the top of the Databases panel. Select a cluster from the list displayed. After you have selected a cluster, the database and table lists will be refreshed using the selected cluster.

Listing Databases and Tables

View available tables by clicking Tables Icon from the main menu. Note that you must have a cluster to view Tables.

Create Tables List


There are two kinds of tables in Databricks:

  • “Global” tables are accessible across all clusters.
  • “Local” tables that are only available within one cluster.

Only global tables are visible in the Tables panel. See Create Tables Programmatically below for more details.

Creating Tables

Tables support a variety of Apache Spark data sources. For example:

To create a table using the UI, click Tables Icon to open the tables list and click the Add Table Icon at the top of the Tables panel. Then follow the steps for your desired data source.


Databricks includes a number of individual notebooks that you can use as a quickstart guide to connect to any Data Source. Simply click “Spark Data Sources” to see these.

Create Tables Programmatically

Global Tables:

Databricks registers global tables to the Hive metastore and makes them available across all clusters.

To create a global table in SQL:

CREATE TABLE tableName ...

See Create Table in the Spark SQL Language Manual for more options.

To create a global table from a DataFrame in Scala or Python:


Local Tables: You can create a local table on a cluster that is not accessible from other clusters and is not registered in the Hive metastore. These are also known as temporary tables.

To create a local table from a Dataframe in Scala or Python in Spark 2.0:


In Spark versions older than 2.0:


Here is an example that creates a local table called “diamonds” from a file in Databricks File System - DBFS:

dataPath = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv""csv")\
  .option("inferSchema", "true")\

Accessing Tables

Table Details

To view a table’s details, click on the table name in the list to reveal a page about that table, including its schema and sample data.

Table Details


In order to display this view, a Spark SQL query runs on the “default” Spark cluster in your account (you can specify the detault cluster on the cluster list page).

If your Spark cluster already has a workload running on it, that can cause this view to take longer to load.

Reading From Tables

These examples show you how to access a table called “diamonds” from a notebook.

From SQL:

SELECT * FROM diamonds

From Python, use one of these examples:

diamonds = spark.sql("SELECT * FROM diamonds")
diamonds = spark.table("diamonds")

From Scala, use one of these examples:

val diamonds = spark.sql("SELECT * FROM diamonds")
val diamonds = spark.table("diamonds")

From R, use one of these examples:

diamonds <- sql(sqlContext, "SELECT * FROM diamonds")
diamonds <- table(sqlContext, "diamonds")

The variable diamonds in these examples is a Spark DataFrame. That means that you can cache them, filter them and perform any other relevant Spark operations on them. See DataFrames and Datasets for more details.

Updating Tables

Table schema is immutable. However, a user can update table data by changing the underlying files.

For example, for tables created from an S3 directory, adding or removing files to that directory will change the contents of the table.

After updating the files underlying a table, be sure to refresh the table using the following command:

%sql REFRESH TABLE <table_name>

This ensures that even if the underlying files changed, when a users accesses the table, Spark SQL will read the correct files.

Deleting Tables

A user can delete tables either through the UI (by selecting the menu dropdown next to the table name) or programmatically via:

%sql DROP TABLE <table_name>

Managed and Unmanaged Tables

Every Spark-SQL table has a metadata information that stores the schema and the data itself.

Managed Tables

Managed tables are Spark SQL tables where Spark manages both the data and the metadata. Some common ways of creating a managed table are:

SQL: CREATE TABLE example_table(id STRING, value STRING)

DataFrame API: dataframe.write.saveAsTable('example_table')

In the case of managed table, Databricks stores the metadata and data in DBFS (Databricks File System) in your account.

Since Spark SQL manages the tables, doing a DROP TABLE example_data will delete both the metadata and data automatically.

Unmanaged Tables

Another option is to let Spark SQL manage the metadata and you control the data’s location. We refer to this as an unmanaged table with S3:

SQL: CREATE TABLE example_table(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (path 'YOUR_S3_PATH')

DataFrame API: dataframe.write.option('path', 'YOUR_S3_PATH').saveAsTable('example_data')

Spark SQL will just manage the relevant metadata, so when you perform DROP TABLE example_data, Spark will only remove the metadata and not the data itself. The data will still be present in the S3 path you provided.

Note that the data need not be in S3. You can also create an unmanaged table with your data in other data sources like Cassandra, Redshift, JDBC table, etc. See Spark Data Sources for more information on a variety of data sources.

Replacing the Table Contents

The simplest approach is to delete the table metadata and data and create another one. Here’s how you can replace the contents of a managed table:

DROP TABLE IF EXISTS example_data    // Deletes the metadata and data

CREATE TABLE example_data AS SELECT ...

To replace the contents of an unmanaged table:

DROP TABLE IF EXISTS example_data    // Deletes the metadata

dbutils.fs.rm('YOUR_S3_PATH', true)   // Deletes the data

CREATE TABLE example_data USING org.apache.spark.sql.parquet OPTIONS (path 'YOUR_S3_PATH') AS SELECT ...


Though the above approach is valid, since all data is on S3, you may run into S3 eventual consistency issue if you try to delete a piece of data and immediately try to recreate it in the same location. You can read more about the consistency issues in this blog.

Another alternative is creating a table using the SQL DDL to create your table.

id LONG,
) using parquet

Then you simply append new data to that storage location, run REFRESH TABLE TABLENAME and you’ll have the new version of the table.

Best Practice to Overwrite Tables

The best approach to avoid this is to overwrite the table.

DataFrame API (Scala):

dataframe.write.mode(SaveMode.Overwrite).saveAsTable('example_data')    // Managed Overwrite

dataframe.write.mode(SaveMode.Overwrite).option('path', 'YOUR_S3_PATH').saveAsTable('example_data')  // UnManaged Overwrite

In SQL you use the INSERT OVERWRITE keyword to perform this. This applies to managed or unmanaged tables.

For example for an unmanaged table:

CREATE TABLE example_table(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (path 'YOUR_S3_PATH')

INSERT OVERWRITE TABLE example_table select ...

Create Partitioned Tables

Spark SQL is able to dynamically generate partitions at the file storage level to provide partition columns for tables. The following allows you to partition data that you write out, and Spark SQL discovers the partitions and registers them in the Hive metastore.

// Create Managed Table As Select
// Create UnManaged/External Table As Select
dataframe.write.mode(SaveMode.Overwrite).option('path', 'FILE_PATH').saveAsTable('example_data')

However, if you create a partitioned table from the existing data, Spark SQL does NOT automatically discover the partitions and register them in the Hive metastore. In this case, SELECT * FROM example_table will show up empty. To register the partitions, you can run the following API to generate the partitions: MSCK REPAIR TABLE 'example_table'

// Save data to external files
// Create UnManaged/External Table``
spark.sql("CREATE TABLE example_table(id STRING, value STRING) USING parquet PARTITIONED BY(id) LOCATION 'FILE_PATH'")
spark.sql("MSCK REPAIR TABLE 'example_table'")

Partition Pruning

When the table is scanned, Spark pushes down the filter predicates involving the partitionBy keys. In that case, Spark will avoid reading all data that wouldn’t satisfy those predicates.

For example, suppose you have a table that is partitioned by date. A query such as SELECT max(id) FROM table WHERE date = '2010-10-10' will only read data files that are known to contain only tuples whose date value is the one specified in the query.