Databases and Tables

Tables are a simple way to make structured data available across your organization. Tables are equivalent to Apache Spark DataFrames. This means that you can cache, filter, and perform any operations supported by DataFrames on tables. You can create a table using the Create Table UI or programmatically. You can query tables with Spark APIs and Spark SQL.

Databricks uses the Hive metastore to manage tables, and supports all file formats and Hive data sources.

Prerequisite

To create and view databases and tables, a cluster must be selected. By default, Databricks selects any running cluster to which you have access and selects the default database. You can change the cluster from the Database menu, create table UI, or view table UI.

View databases and tables

There are two types of tables in Databricks:

  • Global tables, which are accessible across all clusters.
  • Local tables, which are available only within one cluster.

Only global tables are visible in the Tables folder. See Create a table programmatically for information on table types.

  1. Click Tables Icon in the sidebar. The Databases and Tables folders display.

  2. Click the Dropdown Icon at the top of the Databases folder.

  3. Select a cluster. The Databases and Tables folders are refreshed.

    Create Tables List

Create a table

You can create a table from these data sources:

Create a table using the UI

  1. Click Tables Icon in the sidebar. The Databases and Tables folders display.

  2. Click the Add Table Icon at the top of the Tables folder.

  3. Choose a data source and follow the steps to configure the table.

    ../_images/import-table-aws.png
    Upload File
    1. Drag a file to the File dropzone or click the dropzone and choose the files. After upload a path displays. This path will be something like /FileStore/tables/<filename>-<random-number>.<file-type>. You can use this path in a notebook to read data.

      ../_images/data-import-files.png
    2. Click Create Table with UI.

    3. In the Cluster drop-down, choose a cluster.

    S3
    1. Click Create Table with UI.
    2. In the Cluster drop-down, choose a cluster.
    3. Enter a bucket name. The S3 bucket must be accessible from the cluster you selected.
    4. Click Browse Bucket.
    5. Select a file.
    DBFS
    1. Select a file.
    2. Click Create Table with UI.
    3. In the Cluster drop-down, choose a cluster.
  4. Click Preview Table to view the table.

Create a table in a notebook

Databricks includes a number of notebooks that you can use as a quickstart to connect to any data source.

Create a table programmatically

This section describes how to create global and local tables programmatically.

Create a global table

Databricks registers global tables to the Hive metastore and makes them available across all clusters.

To create a global table in SQL:

CREATE TABLE tableName ...

See Create Table in the Spark SQL Language Manual for more options.

To create a global table from a DataFrame in Scala or Python:

dataFrame.write.saveAsTable("<table-name>")

Create a local table

You can create a local table on a cluster that is not accessible from other clusters and is not registered in the Hive metastore. This is also known as a temporary table or a view.

To create a local table from a DataFrame in Scala or Python in Spark 2.0 and above:

dataFrame.createOrReplaceTempView("<table-name>")

To create a local table from a DataFrame in Scala or Python below Spark 2.0:

dataFrame.registerTempTable("<table-name>")

Here is an example that creates a local table called diamonds from a file in Databricks File System - DBFS:

dataPath = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
spark.read.format("csv").option("header","true")\
  .option("inferSchema", "true").load(dataPath)\
  .createOrReplaceTempView("diamonds")

Access a table

You can view table details, read, update, and delete a table.

View table details

  1. Click Tables Icon in the sidebar.

  2. In the Tables folder, click the table name.

  3. In the Cluster drop-down, optionally select another cluster to render the table preview.

    Table Details

    Note

    To display the table preview, a Spark SQL query runs on the cluster selected in the Cluster drop-down. If the cluster already has a workload running on it, it can cause the table preview to take longer to load.

Read a table

These examples show you how to read a table called diamonds.

SQL
SELECT * FROM diamonds
Python
diamonds = spark.sql("SELECT * FROM diamonds")
display(diamonds.select("*"))
diamonds = spark.table("diamonds")
display(diamonds.select("*"))
Scala
val diamonds = spark.sql("SELECT * FROM diamonds")
display(diamonds.select("*"))
val diamonds = spark.table("diamonds")
display(diamonds.select("*"))
R
diamonds <- sql(sqlContext, "SELECT * FROM diamonds")
display(diamonds)
diamonds <- table(sqlContext, "diamonds")
display(diamonds)

Update a table

Table schema is immutable. However, you can update table data by changing the underlying files.

For example, for tables created from an S3 directory, adding or removing files to that directory changes the contents of the table.

After updating the files underlying a table, refresh the table using the following command:

%sql REFRESH TABLE <table-name>

This ensures that when you access the table, Spark SQL reads the correct files even if the underlying files change.

Delete a table

Using the UI
  1. Click Tables Icon in the sidebar.
  2. Click the Menu Dropdown next to the table name and select Delete.
Programmatically
%sql DROP TABLE <table-name>

Managed and unmanaged tables

Every Spark SQL table has a metadata information that stores the schema and the data itself.

A managed table is a Spark SQL table where Spark manages both the data and the metadata. In the case of managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Some common ways of creating a managed table are:

SQL
CREATE TABLE <example-table>(id STRING, value STRING)
DataFrame API
dataframe.write.saveAsTable('<example-table>')

Another option is to let Spark SQL manage the metadata and you control the data’s location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE <example-table>, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.

You can create an unmanaged table with your data in data sources such as Cassandra, JDBC table, and so on. See Spark Data Sources for more information on the data sources supported by Databricks. Some common ways of creating an unmanaged table are:

SQL
CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (path '<your-storage-path>')
DataFrame API
dataframe.write.option('path', '<your-storage-path>').saveAsTable('<example-table>')

Replace table contents

Simple way to replace table contents

The simplest way replace table contents is to delete the table metadata and data and create another table.

To replace the contents of a managed table:

DROP TABLE IF EXISTS <example-table>     // Deletes the metadata and data
CREATE TABLE <example-table>  AS SELECT ...

To replace the contents of an unmanaged table:

DROP TABLE IF EXISTS <example-table>     // Deletes the metadata
dbutils.fs.rm('<your-s3-path>', true)   // Deletes the data
CREATE TABLE example_data USING org.apache.spark.sql.parquet OPTIONS (path '<your-s3-path>') AS SELECT ...

Warning

Although the above approach is valid, since all data is on S3, you might run into S3 eventual consistency issues if you try to delete and immediately try to recreate it in the same location. You can read more about consistency issues in the blog S3mper: Consistency in the Cloud.

Another alternative is to:

  1. Create the table using the SQL DDL:

    CREATE TABLE <table-name> (id LONG, date STRING) using parquet location "<storage-location>"
    
  2. Store new data in <storage-location>.

  3. Run REFRESH TABLE <table-name>.

Best way to replace table contents

To avoid potential consistency issues, the best approach to replacing table contents is to overwrite the table.

DataFrame API (Scala)
dataframe.write.mode(SaveMode.Overwrite).saveAsTable('<example-table>')    // Managed Overwrite
dataframe.write.mode(SaveMode.Overwrite).option('path', '<your-s3-path>').saveAsTable('<example-table>')  // Unmanaged Overwrite

In SQL you use the INSERT OVERWRITE keyword to perform this. This applies to managed or unmanaged tables.

For example, for an unmanaged table:

CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (path '<your-s3-path>')
INSERT OVERWRITE TABLE <example-table> select ...

Partitioned tables

Spark SQL is able to dynamically generate partitions at the file storage level to provide partition columns for tables.

Create a partitioned table

These examples partition data that you write. Spark SQL discovers the partitions and registers them in the Hive metastore.

// Create managed table as select
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable('<example-table>')
// Create unmanaged/external table as select
dataframe.write.mode(SaveMode.Overwrite).option('path', '<file-path>').saveAsTable('<example-table>')

However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Hive metastore. In this case, SELECT * FROM <example-table> will not return results. To register the partitions, run the following to generate the partitions: MSCK REPAIR TABLE '<example-table>'.

// Save data to external files
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").parquet("<file-path>")
// Create Unmanaged/External Table
spark.sql("CREATE TABLE <example-table>(id STRING, value STRING) USING parquet PARTITIONED BY(id) LOCATION '<file-path>'")
spark.sql("MSCK REPAIR TABLE '<example-table>'")

Partition pruning

When the table is scanned, Spark pushes down the filter predicates involving the partitionBy keys. In that case, Spark avoids reading data that doesn’t satisfy those predicates. For example, suppose you have a table <example-data> that is partitioned by <date>. A query such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10' reads only the data files containing tuples whose date value matches the one specified in the query.