Presto and Athena Compatibility Support for Delta Lake

Public Preview

This feature is in Public Preview in Databricks Runtime 5.5.

Presto and Athena support reading from external tables when the list of data files to process is read from a manifest file, which is a text file containing the list of data files to read for querying a table. When an external table is defined in the Hive metastore using manifest files, Presto and Athena use the list of files in the manifest rather than finding the files by directory listing. This article describes how to set up to query Delta tables from Presto and Athena using manifest files and limitations on the compatibility support.

Query a Delta table from Presto or Athena

You can query a Delta table from Presto or Athena using the following steps.

Step 1: Generate manifests from a Delta table using Databricks Runtime

Using a cluster running Databricks Runtime 5.5 or above, run the following SQL command on a Delta table at location pathToDeltaTable:

GENERATE symlink_format_manifest FOR TABLE delta.`pathToDeltaTable`

This command generates manifest files at pathToDeltaTable/_symlink_format_manifest/.

Note

It is recommended that you define the Delta table in a location whose S3 path is known. That is, define the Delta table either with a S3 path or with a DBFS path (mounts allowed) whose underlying S3 path is known. This is necessary as Presto and Athena require S3 paths.

Step 2: Configure Presto or Athena to read generated manifests

  1. Define a new table in the Hive metastore connected to Presto or Athena using the special format SymlinkTextInputFormat and the manifest location pathToDeltaTable/_symlink_format_manifest/:

    CREATE EXTERNAL TABLE mytable ( ... )   -- same schema as the Delta table
    PARTITIONED BY ( ... )  -- optional, must be the same as the Delta table
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 'pathToDeltaTable/_symlink_format_manifest/'  -- location of the generated manifest
    

    The SymlinkTextInputFormat configures Presto or Athena to compute file splits for mytable by reading the manifest file instead of using a directory listing to find data files.

    Important

    • The path in the table definition must be the S3 path. In other words, DBFS paths cannot be used as Presto and Athena understand only S3 path, not DBFS paths.
    • This table definition cannot be used in a query in Databricks. It can be used only by Presto and Athena. See later sections to find out how to define tables for Databricks and Presto or Athena to interoperate in an integrated environment.
    • In Databricks Runtime 5.3, the manifest was located in pathToDeltaTable/_snapshot_manifest/. If you have created the table using Databricks Runtime 5.3 at the old location, then recreate the manifest using Databricks Runtime 5.5 and redefine the table with the new location.

    The tool you use to run the command depends on whether Databricks and Presto or Athena use the same metastore.

    • Same metastore: If both Databricks and Presto or Athena use the same metastore, you can define the table using Databricks.
    • Different metastores: If Databricks and Presto or Athena use different metastores, you must define the table using other tools.
      • Athena: You can define the external table in Athena.
      • Presto: Presto does not support the syntax CREATE EXTERNAL TABLE ... STORED AS ..., so you must use another tool (for example, Spark or Hive) connected to the same metastore as Presto to create the table.
  2. If the Delta table is partitioned, run MSCK REPAIR TABLE mytable after generating the manifests to force the metastore (connected to Presto or Athena) to discover the partitions. This is needed because the manifest of a partitioned table is itself partitioned in the same directory structure as the table. Run this command using the same tool used to create the table. Furthermore, you should run this command:

    • After every manifest generation: New partitions are likely to be visible immediately after the manifest files have been updated. However, doing this too frequently can cause high load for the Hive metastore.
    • As frequently as new partitions are expected: For example, if a table is partitioned by date, then you can run repair once after every midnight, after the new partition has been created in the table and its corresponding manifest files have been generated.
  1. For Presto running in EMR, you may need additional configuration changes. Presto in EMR is configured to use EMRFS which can lead to confusing errors like the following:

    com.facebook.presto.spi.PrestoException: Socket Factory class not found: java.lang.ClassNotFoundException: Class testingforemptydefaultvalue not found
    

    To fix this issue, you must configure Presto to use its own default file systems instead of EMRFS using the following steps:

    1. Open the config file /etc/presto/conf/catalog/hive.properties.

      sudo vi /etc/presto/conf/catalog/hive.properties
      
    2. Change the key hive.s3-file-system-type from EMRFS to PRESTO.

    3. Restart the Presto server.

    sudo stop presto-server
    sudo start presto-server
    

After the above configuration, you should be able to query the Delta table from Presto or Athena in exactly the same way as you would have queried without any manifests.

Step 3: Updating manifests when table data changes

When the data in the Delta table is updated, the manifests must be updated for Presto and Athena to read the updates. This can be done in either of the following approaches.

  • Updating explicitly: After all the data updates, you can run the GENERATE SQL command will update the manifests.

  • Updating automatically: A Delta table can be configured such that all write operations on the table will automatically update the manifests. To enable this automatic mode, set the corresponding table property using the following SQL command.

    ALTER TABLE delta.`pathToDeltaTable` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
    

    To disable this automatic mode, set this property to false. In addition, for partitioned tables, you have to run MSCK REPAIR to ensure the metastore connected to Presto or Athena to update partitions.

    Note

    After enabling automatic mode on a partitioned table, each write operation updates only manifests corresponding to the partitions that operation wrote to. This incremental update ensures that the overhead of manifest generation is low for write operations. However, this also means that if the manifests in other partitions are stale, enabling automatic mode will not automatically fix it. Therefore, Databricks recommends that you explicitly run GENERATE to update manifests for the entire table immediately after enabling automatic mode.

Whether to update explicitly or automatically depends on the concurrent nature of write operations on the Delta table and the desired data consistency. For example, if automatic mode is enabled, then concurrent write operations leads to concurrent overwrites to the manifest files. With such unordered writes, the manifest files are not guaranteed to point to the latest version of the table after the write operations complete. Hence, if concurrent writes are expected and you want to avoid stale manifests, you may consider explicitly updating the manifest after the expected write operations have completed.

Workflow with Databricks and Presto or Athena using the same Hive metastore

A common setup with Databricks and Presto or Athena is to have both of them configured to use the same Hive metastore.

For example, you can use Athena and Databricks integrated with AWS Glue.

Here is the recommended workflow for creating Delta tables, writing to them from Databricks, and querying them from Presto or Athena in such a configuration. All these steps can be executed in Databricks Runtime.

  1. Create the Delta table in Databricks using one of the following methods:

    • Write to a location and then create the table using that location (for example, dataframe.write.format("delta").save("pathToDeltaTable") followed by CREATE TABLE ... USING delta ... LOCATION "pathToDeltaTable").
    • Create the table directly while writing data (for example, dataframe.write.format("delta").saveAsTable(...) or CREATE TABLE ... AS SELECT ...).

    Call this table delta_table_for_db.

  2. Generate the manifests using the Delta.generateHiveManifest(...) method on the path pathToDeltaTable:

    import com.databricks.delta.Delta
    Delta.generateHiveManifest("pathToDeltaTable")
    

    Note the manifest location.

  3. Create another table only for Presto or Athena using the manifest location. Call this table delta_table_for_presto. If the table is partitioned, call MSCK REPAIR TABLE delta_table_for_presto.

There should be two tables defined on the same data:

  • delta_table_for_db: Defined on the data location. All read and write operations in Databricks must use this table. Presto and Athena cannot use this table for any query.
  • delta_table_for_presto: Defined on the manifest location. All read operations from Presto or Athena must use this table. Databricks cannot use this table for any operations.

Remember that any schema changes in the Delta table data will be visible to operations in Databricks using the delta_table_for_db. However, the schema changes will not be visible to Presto or Athena using delta_table_for_db until the table is redefined with the new schema.

Warning

Do not use AWS Glue Crawler on the location pathToDeltaTable to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

Limitations

Presto and Athena compatibility support has known limitations in its behavior.

Data consistency

Whenever Delta Lake generates updated manifests, it atomically overwrites an existing manifest file. Therefore, Presto and Athena will always see a consistent view of the data files; it will see all of the old version files or all of the new version files. However, the granularity of the consistency guarantees depends on whether the table is partitioned or not.

  • Unpartitioned tables: All the files names are written in one manifest file which is updated atomically. In this case Presto and Athena will see full table snapshot consistency.
  • Partitioned tables: A manifest file is partitioned in the same Hive-partitioning-style directory structure as the original Delta table. This means that each partition is updated atomically, and Presto or Athena will see a consistent view of each partition but not a consistent view across partitions. Furthermore, since all manifests of all partitions cannot be updated together, concurrent attempts to generate manifests can lead to different partitions having manifests of different versions.

Performance

Very large numbers of files can hurt the performance of Presto and Athena. Hence, we recommend that you optimize the table before publishing the manifests. We suggest that the number of files should not exceed 1000 (for the entire unpartitioned table or for each partition in a partitioned table).

Schema evolution

Delta supports schema evolution and queries on a Delta table automatically use the latest schema regardless of the schema defined in the table in the Hive metastore. However, Presto or Athena uses the schema defined in the Hive metastore and will not query with the updated schema until the table used by Presto or Athena is redefined to have the updated schema.