Databricks SQL Partitions

A partition constitutes a subset of rows in a table sharing the same value for a predefined subset of columns called the partitioning columns. Using partitions can speed up queries against the table as well as data manipulation.

To exploit partitioning you first define the set of partitioning column when you create a table using the PARTITIONED BY clause.

When inserting or manipulating rows in a table Databricks SQL automatically dispatches rows into the appropriate partitions.

You can also specify the partition directly using a PARTITION clause.

This syntax is also available for non Delta Lake tables to DROP, ADD or RENAME partitions quickly using the ALTER TABLE statement.

PARTITIONED BY

The PARTITIONED BY clause specified a list of columns along which the new table is partitioned.

Syntax

PARTITIONED BY ( { partition_column [ column_type ] } [, ...] )

Parameters

  • partition_column

    An identifier which may reference a column_identifier in the table. If you specify more than one column there must be no duplicates. You cannot reference all columns in the table’s column_specification.

  • column_type

    Unless the partition_column refers to a column_identifier in the table’s column_specification, column_type defines the data type of the partition_column.

    Not all data types supported by Databricks SQL are supported by all data sources.

Notes

Unless you define a Delta Lake table partitioning columns referencing the columns in the column specification are always moved to the end of the table.

PARTITION

You use the PARTITION clause to identify a partition to be queried or manipulated.

To name an individual partition all columns identified in the PARTITION BY clause must be named and associated with value. But you need not specify them in a specific order.

Unless you are adding a new partition to an existing table you may omit columns, or values to indicate that the operation applies to the all matching partitions matching the subset of columns.

PARTITION ( { partition_column  [ = partition_value ] } [ , ... ] )

Parameters

  • partition_column

    An column named as a partition column of the table. You may not specify the same column twice.

  • partition_value

    A literal of a data type matching the type of the partition column. If you omit a partition value the specification will match all values for this partition column.

Examples

-- Use the PARTTIONED BY clause in a table definition
> CREATE TABLE student(university STRING,
                       major      STRING,
                       name       STRING)
         PARTITIONED BY(university, major)

> CREATE TABLE professor(name STRING)
         PARTITIONED BY(university STRING,
                        department STRING);

-- Use the PARTITION specification to INSERT into a table
> INSERT INTO student
         PARTITION(university= 'TU Kaiserslautern') (major, name)
         SELECT major, name FROM freshmen;

-- Use the partition specification to add and drop a partition
> CREATE TABLE log(date DATE, id INT, event STRING)
     USING CSV LOCATION 'dbfs:/log'
     PARTITIONED BY (date);

> ALTER TABLE log ADD PARTIITON(date = DATE'2021-09-10');

> ALTER TABLE log DROP PARTITION(date = DATE'2021-09-10');

-- Drop all partitions from the named university, independent of the major.
> ALTER TABLE student DROP PARTITION(university = 'TU Kaiserslautern');