Databases and Tables

Tables are a simple way to make structured data available across your organization. A user can query tables with Spark SQL or any of Apache Spark’s language APIs. A user defines a table using the Create Table UI (Creating Tables) or programmatically (Create Tables Programmatically).

Databricks uses the Hive metastore to manage tables, and supports all file formats and Hive data sources. Tables are no different from Apache Spark DataFrames. That means that you can cache them, filter them and perform any other relevant Spark operations on them.

Tip

Use the Spark SQL Language Manual as a reference for SQL commands.

Choosing a cluster

In Databricks, creating and viewing databases and tables requires you to select a cluster. By default, Databricks will choose any cluster to which you currently have access. The cluster used can be changed at any time from the Data sidebar, Create Table UI, or the View Table UI.

To change the cluster used from the Data sidebar, click Tables Icon, then open the cluster selector by clicking Data Cluster Dropdown Icon at the top of the Databases panel. Select a cluster from the list displayed. After you have selected a cluster, the database and table lists will be refreshed using the selected cluster.

Listing Databases and Tables

View available tables by clicking Tables Icon from the main menu. Note that you must have a cluster to view Tables.

Create Tables List

Note

There are two kinds of tables in Databricks:

  • “Global” tables are accessible across all clusters.
  • “Local” tables that are only available within one cluster.

Only global tables are visible in the Tables panel. See Create Tables Programmatically below for more details.

Creating Tables

Tables support a variety of Apache Spark data sources. For example:

To create a table using the UI, click Tables Icon to open the tables list and click the Add Table Icon at the top of the Tables panel. Then follow the steps below for your desired datasource.

sources

Create from Amazon S3

You can create a table directly from data you have on S3.

Note

Databricks will not read data from a table created from S3 until a user accesses it. If you change the data in the S3 directory, the table will reflect the new data. See Updating Tables for details.

  1. Select S3 and fill in your S3 information including bucket name.
Table Create from S3
  1. A File Browser appears which allows you to select the directory that contains the csv text files to import. Then hit Preview.
S3 File Browser
  1. Finally, preview the table and select a name, the file format, delimiter, and the column names and types. Click “Create Table” to finish creating the table.
S3 Table Details

Create from local file(s)

  1. Select File as the data source, then drag and drop your CSV, JSON, Parquet, or Avro files.
Table Create from File
  1. Fill in the table name, check the file format, set the column delimiter, and name the table columns. Click “Create Table” to finish creating the table.

Tip

If you want to process the files in Apache Spark without creating a table, follow the guide in Importing Data.

Create from DBFS

If you already have data in Databricks File System - DBFS, you can create a table from it.

Databricks will not read data from a table created from S3 until a user accesses it. If you change the data in the S3 directory, the table will reflect the new data. See Updating Tables for details.

  1. Select DBFS as the data source and select the directory that contains the files you want to create the table from. You must select a directory, not a file.
Table Create from DBFS
  1. Preview the table, select a name, the file format, delimiter, and the column names and types. Click “Create Table” to finish creating the table.

Create from JDBC databases

Tables can created from JDBC databases (e.g., MySQL, PostgreSQL).

  1. Select JDBC as the data source and fill in the JDBC URL of your database, your username and password, and the table name or SQL query to create the table from.

Note that you can use anything that is valid in the FROM clause of a SQL query in the “Table / Query” field.

For example, instead of a full table you could also use a subquery in parentheses.

Table Create from JDBC
  1. Click “Connect” and name the table. Click “Create Table” to finish creating the table.

Create Tables Programmatically

Global Tables:

Databricks registers global tables to the Hive metastore and makes them available across all clusters.

To create a global table in SQL:

CREATE TABLE tableName ...

See Create Table in the Spark SQL Language Manual for more options.

To create a global table from a DataFrame in Scala or Python:

dataFrame.write.saveAsTable("tableName")

Local Tables: You can create a local table on a cluster that is not accessible from other clusters and is not registered in the Hive metastore. These are also known as temporary tables.

To create a local table from a Dataframe in Scala or Python in Spark 2.0:

dataFrame.createOrReplaceTempView("diamonds")

In Spark versions older than 2.0:

dataFrame.registerTempTable("diamonds")

Here is an example that creates a local table called “diamonds” from a file in Databricks File System - DBFS:

dataPath = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
sqlContext.read.format("com.databricks.spark.csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load(dataPath)\
  .createOrReplaceTempView("diamonds")

Accessing Tables

Table Details

To view a table’s details, click on the table name in the list to reveal a page about that table, including its schema and sample data.

Table Details

Note

In order to display this view, a Spark SQL query runs on the “default” Spark cluster in your account (you can specify the detault cluster on the cluster list page).

If your Spark cluster already has a workload running on it, that can cause this view to take longer to load.

Reading From Tables

These examples show you how to access a table called “diamonds” from a notebook.

From SQL:

SELECT * FROM diamonds

From Python, use one of these examples:

diamonds = sqlContext.sql("SELECT * FROM diamonds")
display(diamonds.select("*"))
diamonds = sqlContext.table("diamonds")
display(diamonds.select("*"))

From Scala, use one of these examples:

val diamonds = sqlContext.sql("SELECT * FROM diamonds")
display(diamonds.select("*"))
val diamonds = sqlContext.table("diamonds")
display(diamonds.select("*"))

From R, use one of these examples:

diamonds <- sql(sqlContext, "SELECT * FROM diamonds")
display(diamonds)
diamonds <- table(sqlContext, "diamonds")
display(diamonds)

The variable diamonds in these examples is a Spark DataFrame. That means that you can cache them, filter them and perform any other relevant Spark operations on them. See DataFrames and Datasets for more details.

Updating Tables

Table schema is immutable. However, a user can update table data by changing the underlying files.

For example, for tables created from an S3 directory, adding or removing files to that directory will change the contents of the table.

After updating the files underlying a table, be sure to refresh the table using the following command:

%sql REFRESH TABLE <table_name>

This ensures that even if the underlying files changed, when a users accesses the table, Spark SQL will read the correct files.

Deleting Tables

A user can delete tables either through the UI (by selecting the menu dropdown next to the table name) or programmatically via:

%sql DROP TABLE <table_name>

Managed and Unmanaged Tables

Every Spark-SQL table has a metadata information that stores the schema and the data itself.

Managed Tables

Managed tables are Spark SQL tables where Spark manages both the data and the metadata. Some common ways of creating a managed table are:

SQL: create table example_table(id string, value string)

DataFrame API: dataframe.write.saveAsTable('example_table')

In the case of managed table, Databricks stores the metadata and data in DBFS (Databricks File System) in your account.

Since Spark SQL manages the tables, doing a drop table example_data will delete both the metadata and data automatically.

Unmanaged Table

Another option is to let Spark SQL manage the metadata and you control the data’s location. We refer to this as an unmanaged table with S3:

SQL: create table example_table(id string, value string) using org.apache.spark.sql.parquet options (path 'YOUR_S3_PATH')

DataFrame API: dataframe.write.option('path', 'YOUR_S3_PATH').saveAsTable('example_data')

Spark SQL will just manage the relevant metadata, so when you perform drop table example_data, Spark will only remove the metadata and not the data itself. The data will still be present in the S3 path you provided.

Note that the data need not be in S3. You can also create an unmanaged table with your data in other data sources like Cassandra, Redshift, JDBC table, etc. See Spark Data Sources for more information on a variety of data sources.

Replacing the contents of a table

The simplest approach is to delete the table metadata and data and create another one. Here’s how you can replace the contents of a managed table:

drop table if exists example_data    // Deletes the metadata and data

create table example_data as select ...

To replace the contents of an unmanaged table:

drop table if exists example_data    // Deletes the metadata

dbutils.fs.rm('YOUR_S3_PATH', true)   // Deletes the data

create table example_data using org.apache.spark.sql.parquet options (path 'YOUR_S3_PATH') as select ...

** S3 Eventual Consistency **

Though the above approach is valid, since all data is on S3, you may run into S3 eventual consistency issue if you try to delete a piece of data and immediately try to recreate it in the same location. You can read more about the consistency issues in this blog.

Best Practice to Overwrite Tables

The best approach to avoid this is to overwrite the table.

DataFrame API (Scala):

dataframe.write.mode(SaveMode.Overwrite).saveAsTable('example_data')    // Managed Overwrite

dataframe.write.mode(SaveMode.Overwrite).option('path', 'YOUR_S3_PATH').saveAsTable('example_data')  // UnManaged Overwrite

In SQL you use the insert overwrite keyword to perform this. This applies to managed or unmanaged tables.

For example for an unmanaged table:

create table example_table(id string, value string) using org.apache.spark.sql.parquet options (path 'YOUR_S3_PATH')

insert overwrite table example_table select ...

Hive Partitions vs Spark SQL Partitions

Spark SQL is able to dynamically generate partitions at the file storage level to provide partition columns for tables. The following allows you to partition data that you write out.

dataframe.write.partitionBy('colName').saveAsTable('example_table')

Spark SQL does not register the partitions in the Hive metastore. Therefore, show partitions example_table will show up empty. To register the partitions, you can run the following API to generate the partitions: MSCK REPAIR TABLE 'example_table'

Partition Pruning

Hive-style partitions are useful when the table is scanned and there are pushed-down filter predicates involving the partitionBy keys. In that case, Spark will avoid reading all data that wouldn’t satisfy those predicates.

For example, suppose you have a table that is partitioned by date. A query such as SELECT max(id) FROM table WHERE date = '2010-10-10' will only read data files that are known to contain only tuples whose date value is the one specified in the query.