Problem: Create Table in Overwrite Mode Fails When Interrupted

Problem

When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:

Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`').
The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;

Version

This problem can occur in Databricks Runtime 5.0 and above.

Cause

This problem is due to a change in the default behavior of Spark in version 2.4.

This problem can occur if:

  • The cluster is terminated while a write operation is in progress.
  • A temporary network issue occurs.
  • The job is interrupted.

Once the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. Basically, the problem is that a metadata directory called _STARTED isn’t deleted automatically when Databricks tries to overwrite it.

You can reproduce the problem by following these steps:

  1. Create a DataFrame:

    val df = spark.range(1000)

  2. Write the DataFrame to a location in overwrite mode:

    df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable")

  3. Cancel the command while it is executing.

  4. Re-run the write command.

Solution

Set the flag spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation to true. This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook:

spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

Or you can set it as a cluster level Spark Configuration:

spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true

Another option is to manually clean up the data directory specified in the error message.