Databases and Tables

A Databricks database is a collection of tables. A Databricks table is a collection of structured data. Tables are equivalent to Apache Spark DataFrames. This means that you can cache, filter, and perform any operations supported by DataFrames on tables. You can query tables with Spark APIs and Spark SQL.

There are two types of tables: global and local. A global table is available across all clusters. Databricks registers global tables to the Hive metastore. For details about Hive support, see Apache Hive Compatibility. A local table is not accessible from other clusters and is not registered in the Hive metastore. This is also known as a temporary table or a view.

You can create a table using the Create Table UI or programmatically. A table can be populated from files in DBFS or data stored in any of the supported data sources.

Requirements

To view and create databases and tables, you must have a running cluster.

View databases and tables

Click Data Icon in the sidebar. Databricks selects any running cluster to which you have access. The Databases folder displays the list of databases with the default database selected. The Tables folder displays the list of tables in the default database.

Create Tables List

You can change the cluster from the Databases menu, create table UI, or view table UI. For example, from the Databases menu:

  1. Click the Dropdown Icon at the top of the Databases folder.

  2. Select a cluster.

    Create Tables List

Create a table

You can create a table using the UI or programmatically.

Create a table using the UI

Note

You cannot use this method to upload data if you are using a high concurrency cluster. Instead, use the Databricks File System - DBFS to load your data into Databricks.

When you create a table using the UI, you create a global table.

  1. Click Data Icon in the sidebar. The Databases and Tables folders display.

  2. In the Database folder, select a database.

  3. Above the Tables folder, click Add Data.

    Add Table Icon

  4. Choose a data source and follow the steps to configure the table.

    ../_images/import-table-aws.png
    Upload File
    1. Drag files to the File dropzone or click the dropzone to browse to and choose files. After upload, a path displays for each file. The path will be something like /FileStore/tables/<filename>-<random-number>.<file-type> and you use this path in a notebook to read data.

      ../_images/data-import-files.png
    2. Click Create Table with UI.

    3. In the Cluster drop-down, choose a cluster.

    S3
    1. Click Create Table with UI.
    2. In the Cluster drop-down, choose a cluster.
    3. Enter a bucket name. The S3 bucket must be accessible from the cluster you selected.
    4. Click Browse Bucket.
    5. Select a file.
    DBFS
    1. Select a file.
    2. Click Create Table with UI.
    3. In the Cluster drop-down, choose a cluster.
  5. Click Preview Table to view the table.

  6. In the Table Name field, optionally override the default table name.

  7. In the Create in Database field, optionally override the selected default database.

  8. In the File Type field, optionally override the inferred file type.

  9. If the file type is CSV:

    1. In the Column Delimiter field, select whether to override the inferred delimiter.
    2. Indicate whether to use the first row as the column titles.
    3. Indicate whether to infer the schema.
  10. If the file type is JSON, indicate whether the file is multi-line.

  11. Click Create Table.

Create a table in a notebook

In the Create New Table UI you can use quickstart notebooks provided by Databricks to connect to any data source.

S3
Click Create Table in Notebook. The S3 bucket must be accessible from the cluster to which the notebook is attached.
DBFS
Click Create Table in Notebook.
Other Data Sources
In the Connector drop-down, select a data source type. Then click Create Table in Notebook.

Create a table programmatically

This section describes how to create global and local tables programmatically.

Create a global table

To create a global table in SQL:

create table <table-name> ...

See Create Table in the Spark SQL Language Manual for more options.

To create a global table from a DataFrame in Scala or Python:

dataFrame.write.saveAsTable("<table-name>")

Create a local table

To create a local table from a DataFrame in Scala or Python:

dataFrame.createOrReplaceTempView("<table-name>")

Here is an example that creates a local table called diamonds from a file in Databricks File System - DBFS:

dataFrame = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
spark.read.format("csv").option("header","true")\
  .option("inferSchema", "true").load(dataFrame)\
  .createOrReplaceTempView("diamonds")

Access a table

You can view table details, read, update, and delete a table.

View table details

The table details view shows the table schema and sample data.

  1. Click Data Icon in the sidebar.

  2. In the Databases folder, click a database.

  3. In the Tables folder, click the table name.

  4. In the Cluster drop-down, optionally select another cluster to render the table preview.

    Table Details

    Note

    To display the table preview, a Spark SQL query runs on the cluster selected in the Cluster drop-down. If the cluster already has a workload running on it, the table preview may take longer to load.

Read a table

These examples show you how to read and display a table called diamonds.

SQL
select * from diamonds
Python
diamonds = spark.sql("select * from diamonds")
display(diamonds.select("*"))
diamonds = spark.table("diamonds")
display(diamonds.select("*"))
Scala
val diamonds = spark.sql("select * from diamonds")
display(diamonds.select("*"))
val diamonds = spark.table("diamonds")
display(diamonds.select("*"))
R
diamonds <- sql(sqlContext, "select * from diamonds")
display(diamonds)
diamonds <- table(sqlContext, "diamonds")
display(diamonds)

Update a table

The table schema is immutable. However, you can update table data by changing the underlying files.

For example, for tables created from an S3 directory, adding or removing files in that directory changes the contents of the table.

After updating the files underlying a table, refresh the table using the following command:

REFRESH TABLE <table-name>

This ensures that when you access the table, Spark SQL reads the correct files even if the underlying files change.

Delete a table

Using the UI
  1. Click Data Icon in the sidebar.
  2. Click the Menu Dropdown next to the table name and select Delete.
Programmatically
DROP TABLE <table-name>

Managed and unmanaged tables

Every Spark SQL table has metadata information that stores the schema and the data itself.

A managed table is a Spark SQL table for which Spark manages both the data and the metadata. In the case of managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Some common ways of creating a managed table are:

SQL
create table <example-table>(id STRING, value STRING)
DataFrame API
dataframe.write.saveAsTable("<example-table>")

Another option is to let Spark SQL manage the metadata, while you control the data location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE <example-table>, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.

You can create an unmanaged table with your data in data sources such as Cassandra, JDBC table, and so on. See Data Sources for more information about the data sources supported by Databricks. Some common ways of creating an unmanaged table are:

SQL
create table <example-table>(id STRING, value STRING) using org.apache.spark.sql.parquet options (path "<your-storage-path>")
DataFrame API
dataframe.write.option('path', "<your-storage-path>").saveAsTable("<example-table>")

Replace table contents

There are two main approaches to replacing table contents: simple and recommended.

Simple way to replace table contents

The simplest way replace table contents is to delete the table metadata and data and create another table.

Managed table
drop table if exists <example-table>     // deletes the metadata and data
create table <example-table>  as select ...
Unmanaged table
drop table if exists <example-table>    // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true)   // deletes the data
create table <example-table> using org.apache.spark.sql.parquet options (path "<your-s3-path>") as select ...

Warning

py

Although the above approach is valid, since all data is on S3, you might run into S3 eventual consistency issues if you try to delete and immediately try to recreate it in the same location. You can read more about consistency issues in the blog S3mper: Consistency in the Cloud.

An alternative is to:

  1. Create the table using the SQL DDL:

    create table <table-name> (id long, date string) using parquet location "<storage-location>"
    
  2. Store new data in <storage-location>.

  3. Run refresh table <table-name>.

Partitioned tables

Spark SQL is able to generate partitions dynamically at the file storage level to provide partition columns for tables.

Create a partitioned table

These examples partition data that you write. Spark SQL discovers the partitions and registers them in the Hive metastore.

// Create managed table as select
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("<example-table>")

// Create unmanaged/external table as select
dataframe.write.mode(SaveMode.Overwrite).option("path", "<file-path>").saveAsTable("<example-table>")

However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Hive metastore. In this case, SELECT * FROM <example-table> does not return results. To register the partitions, run the following to generate the partitions: MSCK REPAIR TABLE "<example-table>".

// Save data to external files
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").parquet("<file-path>")

// Create unmanaged/external table
spark.sql("CREATE TABLE <example-table>(id STRING, value STRING) USING parquet PARTITIONED BY(id) LOCATION "<file-path>"")
spark.sql("MSCK REPAIR TABLE "<example-table>"")

Partition pruning

When the table is scanned, Spark pushes down the filter predicates involving the partitionBy keys. In that case, Spark avoids reading data that doesn’t satisfy those predicates. For example, suppose you have a table <example-data> that is partitioned by <date>. A query such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10' reads only the data files containing tuples whose date value matches the one specified in the query.

Table access control

Table access control allow admins and users to give fine-grained access to other users. See Table Access Control for details.