Photon

Preview

This feature is in Public Preview.

Photon is the native vectorized query engine on Databricks, written to be directly compatible with Apache Spark APIs so it works with your existing code. It is developed in C++ to take advantage of modern hardware, and uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications-—all natively on your data lake. Photon is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster and reduces your total cost per workload.

Photon activation depends on whether you are using Databricks clusters or Databricks SQL endpoints.

Databricks clusters

To access Photon on Databricks clusters you must explicitly select a runtime containing Photon when you create the cluster, either using the UI or the APIs (Clusters API and Jobs API, specifying spark_version using the syntax 8.3.x-photon-scala2.12). Photon is available for clusters running the Photon variant of Databricks Runtime 8.3 and above.

Photon supports a limited set of instance types on the driver and worker nodes. Photon instance types consume DBUs at a different rate than the same instance type running the non-Photon runtime. For more information about Photon instances and DBU consumption, see the Databricks pricing page.

Databricks SQL endpoints

Photon is enabled by default in Databricks SQL endpoints. You can confirm that Photon is enabled for a SQL endpoint by clicking Endpoints Icon Endpoints in the sidebar, selecting the endpoint, and checking that the value for Photon is On.

Advantages

The following summarizes the advantages of Photon:

  • Supports SQL and equivalent DataFrame operations against Delta and Parquet tables.
  • Expected to accelerate queries that process a significant amount of data (100GB+) and include aggregations and joins.
  • Data is accessed repeatedly and likely in the Delta Lake cache.
  • More robust scan performance on tables with many columns and many small files.
  • Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, especially for wide tables (hundreds to thousands of columns).
  • Photon replaces sort-merge joins with hash-joins.

Limitations

  • Works on Delta and Parquet tables only for both read and write.
  • Does not support the following data types:
    • Map
    • Array
  • Does not support window and sort operators
  • Does not support Spark Structured Streaming.
  • Does not support UDFs.
  • Not expected to improve operations bottlenecked by network or scan I/O.
  • Not expected to improve short-running queries (<2 seconds), for example, against small data.

Features not supported by Photon run the same way they would with Databricks Runtime; there is no performance advantage for those features.