SELECT

Retrieves result sets from one or more tables.

Syntax

[ WITH with_query [ , ... ] ]
select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_statement, ... ]
  [ ORDER BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ]
  [ SORT BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ]
  [ CLUSTER BY { expression [ , ... ] } ]
  [ DISTRIBUTE BY { expression [, ... ] } ]
  [ WINDOW clause ]
  [ LIMIT clause ]

While select_statement is defined as

SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_names ] [ , ... ] }
  FROM { from_item [ , ... ] }
  [ PIVOT clause ]
  [ LATERAL VIEW clause ] [ ... ]
  [ WHERE boolean_expression ]
  [ GROUP BY expression [ , ... ] ]
  [ HAVING boolean_expression ]
  [ QUALIFY boolean_expression ]

Parameters

  • with_query

    One or more common table expressions before the main query block.These table expressions can be referenced later in the FROM clause. This is useful to abstract out repeated subquery blocks in the FROM clause and improves readability of the query.

  • hints

    Hints help the Spark optimizer make better planning decisions. Spark supports hints that influence selection of join strategies and repartitioning of the data.

  • ALL

    Select all matching rows from the relation. Enabled by default.

  • DISTINCT

    Select all matching rows from the relation after removing duplicates in results.

  • named_expression

    An expression with an assigned name. Denotes a column expression.

    Syntax: expression [AS] [alias]

  • from_item

    A source of input for the query. One of the following:

    • Table identifier

      • [database_name.] table_name: A table name, optionally qualified with a database name.
      • delta.`<path-to-table>` : The location of an existing Delta table.

      You can specify a time travel version using TIMESTAMP AS OF, VERSION AS OF, or @ syntax after a table relation. See Query an older snapshot of a table (time travel) for details.

    • JOIN

      Combines two or more relations using a join.

    • FILE

      Query a file with a specified format directly with SQL.

    • Table-valued function (TVF)

      Invokes a table function.

    • Inline table

      Defines an inline table.

    • Subquery

    • [ LATERAL ] ( query )

      Computes a relation using a query.

      LATERAL

      Note

      Available in Databricks Runtime 9.0 and above.

      You can specify the LATERAL keyword which allows the query to reference columns in preceding from_items.

  • PIVOT

    Used for data perspective; you can get the aggregated values based on specific column value.

  • LATERAL VIEW

    Used in conjunction with generator functions such as EXPLODE, which generates a virtual table containing one or more rows. LATERAL VIEW applies the rows to each original output row.

  • WHERE

    Filters the result of the FROM clause based on the supplied predicates.

  • GROUP BY

    The expressions that are used to group the rows. This is used in conjunction with aggregate functions (MIN, MAX, COUNT, SUM, AVG) to group rows based on the grouping expressions and aggregate values in each group. When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.

  • HAVING

    The predicates by which the rows produced by GROUP BY are filtered. The HAVING clause is used to filter rows after the grouping is performed. If you specify HAVING without GROUP BY, it indicates a GROUP BY without grouping expressions (global aggregate).

  • QUALIFY

    Note

    Available in Databricks Runtime 10.0 and above.

    The predicates that are used to filter the results of window functions. To use QUALIFY, at least one window function is required to be present in the SELECT list or the QUALIFY clause.

  • ORDER BY

    An ordering of the rows of the complete result set of the query. The output rows are ordered across the partitions. This parameter is mutually exclusive with SORT BY, CLUSTER BY, and DISTRIBUTE BY and cannot be specified together.

  • SORT BY

    An ordering by which the rows are ordered within each partition. This parameter is mutually exclusive with ORDER BY and CLUSTER BY and cannot be specified together.

  • CLUSTER BY

    A set of expressions that is used to repartition and sort the rows. Using this clause has the same effect of using DISTRIBUTE BY and SORT BY together.

  • DISTRIBUTE BY

    A set of expressions by which the result rows are repartitioned. This parameter is mutually exclusive with ORDER BY and CLUSTER BY and cannot be specified together.

  • WINDOW

    Defines named window specifications that can be shared by multiple Window functions in the select_query.

  • LIMIT

    The maximum number of rows that can be returned by a statement or subquery. This clause is mostly used in the conjunction with ORDER BY to produce a deterministic result.

  • boolean_expression

    Any expression that evaluates to a result type Boolean. You can combine two or more expressions together using the logical operators ( AND, OR ).

  • expression

    A combination of one or more values, operators, and SQL functions that evaluates to a value.

Select on Delta table

In addition to the standard SELECT options, Delta tables support the time travel options described in this section. For details, see Query an older snapshot of a table (time travel).

AS OF syntax

SELECT * FROM table_identifier TIMESTAMP AS OF timestamp_expression
SELECT * FROM table_identifier VERSION AS OF version

where

  • timestamp_expression can be any one of:
    • '2018-10-18T22:15:12.013Z', that is, a string that can be cast to a timestamp
    • cast('2018-10-18 13:36:32 CEST' as timestamp)
    • '2018-10-18', that is, a date string
    • In Databricks Runtime 6.6 and above:
      • current_timestamp() - interval 12 hours
      • date_sub(current_date(), 1)
      • Any other expression that is or can be cast to a timestamp
  • version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec.

Neither timestamp_expression nor version can be subqueries.

Example

SELECT * FROM events TIMESTAMP AS OF '2018-10-18T22:15:12.013Z'
SELECT * FROM delta.`/mnt/delta/events` VERSION AS OF 123

@ syntax

Use the @ syntax to specify the timestamp or version. The timestamp must be in yyyyMMddHHmmssSSS format. You can specify a version after @ by prepending a v to the version. For example, to query version 123 for the table events, specify events@v123.

Example

SELECT * FROM events@20190101000000000
SELECT * FROM events@v123