Apache Hive

Spark SQL in Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes and UDFs.

SerDes and UDFs

Currently Hive SerDes and UDFs are based on Hive 1.2.1.

Metastore Connectivity

Please see External Hive Metastore for information on how to connect Databricks to an externally hosted Hive metastore.

Supported Hive Features

Spark SQL supports the vast majority of Hive features, such as:

  • Hive query statements, including:
    • SELECT
    • GROUP BY
    • ORDER BY
    • CLUSTER BY
    • SORT BY
  • All Hive expressions, including:
    • Relational expressions (=, , ==, <>, <, >, >=, <=, etc)
    • Arithmetic expressions (+, -, *, /, %, etc)
    • Logical expressions (AND, &&, OR, ||, etc)
    • Complex type constructors
    • Mathematical expressions (sign, ln, cos, etc)
    • String expressions (instr, length, printf, etc)
  • User defined functions (UDF)
  • User defined aggregation functions (UDAF)
  • User defined serialization formats (SerDes)
  • Window functions
  • Joins
    • JOIN
    • {LEFT|RIGHT|FULL} OUTER JOIN
    • LEFT SEMI JOIN
    • CROSS JOIN
  • Unions
  • Sub-queries
    • SELECT col FROM ( SELECT a + b AS col from t1) t2
  • Sampling
  • Explain
  • Partitioned tables including dynamic partition insertion
  • View
  • Vast majority of DDL statements, including:
    • CREATE TABLE
    • CREATE TABLE AS SELECT
    • ALTER TABLE
  • Most Hive Data types, including:
    • TINYINT
    • SMALLINT
    • INT
    • BIGINT
    • BOOLEAN
    • FLOAT
    • DOUBLE
    • STRING
    • BINARY
    • TIMESTAMP
    • DATE
    • ARRAY<>
    • MAP<>
    • STRUCT<>

Unsupported Hive Functionality

Below is a list of Hive features that we don’t support yet. Most of these features are rarely used in Hive deployments.

Major Hive Features

  • Writing to bucketed table created by Hive.
  • ACID fine-grained updates

Esoteric Hive Features

  • Union type
  • Unique join
  • Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the hive metastore.

Hive Input/Output Formats

  • File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
  • Hadoop archive

Hive Optimizations

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

  • Block level bitmap indexes and virtual columns (used to build indexes)
  • Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you need to control the degree of parallelism post-shuffle using “SET spark.sql.shuffle.partitions=[num_tasks];”.
  • Skew data flag: Spark SQL does not follow the skew data flags in Hive.
  • STREAMTABLE hint in join: Spark SQL does not follow the STREAMTABLE hint.
  • Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.