How Delta tables work
All new tables in Databricks are, by default created as Delta tables. A Delta table stores data as a directory of files in cloud object storage and registers that table’s metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables.
Delta tables contain rows of data that can be queried and updated using SQL,Python and Scala APIs. Delta tables store metadata in the open source Delta Lake format. As a user, you can treat these tables much as you would tables in a database - you can insert, update, delete and merge data into them. Databricks takes care of storing and organizing the data in a manner that supports efficient operations. Since the data is stored in the open Delta Lake format, you can read it and write it from many other products besides Databricks.
While it is possible to create tables on Databricks that don’t use Delta Lake, those tables don’t provide the transactional guarantees or optimized performance of Delta tables. For more information about other table types that use formats other than Delta Lake, see What is a table?.
The following sample code creates a Delta table from the sample NYC taxi trips datasets, filtering down to rows that contain a fare greater than $10. This table isn’t updated when new rows are added or updated in samples.nyctaxi.trips
:
filtered_df = (
spark.read.table("samples.nyctaxi.trips")
.filter(col("fare_amount") > 10.0)
)
filtered_df.write.saveAsTable("catalog.schema.filtered_taxi_trips")
You can now query this Delta table using languages like SQL or Python.
Delta tables and regular views
A view is the result of a query over one or more tables and views in Unity Catalog. You can create a view from tables and from other views in multiple schemas and catalogs.
A regular view is a query whose result is recomputed every time the view is queried. The primary benefit of a view is that it allows you to hide the complexity of the query from users, because they can query the view like a regular table. However, because regular views are recomputed every time a query runs, they can be expensive for complex queries or queries that process a lot of data.
The following diagram shows how regular views work.
The following sample code creates a regular view from the sample NYC taxi trips datasets, filtering down to rows that contain a fare greater than $10. This view always returns correct results even if new rows are added or existing rows are updated in samples.nyctaxi.trips
:
filtered_df = (
spark.read.table("samples.nyctaxi.trips")
.filter(col("fare_amount") > 10.0)
)
filtered_df.write.createOrReplaceTempView("catalog.schema.v_filtered_taxi_trips")
You can now query this regular view using languages like SQL or Python.