Update jobs when you upgrade legacy workspaces to Unity Catalog
When you upgrade legacy workspaces to Unity Catalog, you might need to update existing jobs to reference upgraded tables and filepaths. The following table lists typical scenarios and suggestions for updating your jobs.
Scenario | Solution |
---|---|
Job is using a notebook that has references to custom libraries either through an init script, or cluster-defined libraries. A custom library would be defined as a non-publicly available | Modify the custom library to ensure that:
|
Job is using a notebook that is reading from or writing to a Hive metastore table. |
|
Job is using a notebook that is reading from or writing to paths that are subfolders of tables. This is not possible in Unity Catalog. |
|
Job is using a notebook that is reading from or writing to mount paths that are tables registered in Unity Catalog |
|
Job is using a notebook that reads or writes files, not tables, via mount paths. | Change code to write to a volume location instead. |
Job is a streaming job that uses | Not currently supported. Consider rewriting if possible, or do not attempt to refactor this job until support is provided. |
Job is a streaming job that uses continuous processing mode. | Continuous processing mode is experimental in Spark and is not supported in Unity Catalog. Refactor the job to use structured streaming. If this is not possible, consider keeping the job running against Hive metastore. |
Job is a streaming job that uses checkpoint directories. |
|
Job has a cluster definition below Databricks Runtime 11.3. |
|
Job has notebooks that interact with storage or tables. | The service principal that the job was running as must be provided with read and write access to requisite resources in Unity Catalog, such as volumes, tables, external locations, and so forth. |
Job is a Lakeflow Declarative Pipelines. |
|
Job has notebooks that use non-storage cloud services, such as AWSKinesis, and the configuration used to connect uses an instance profile. |
|
Job uses Scala |
|
Job has notebooks which use Scala UDFs |
|
Job has tasks which use MLR | Run on dedicated compute. |
Job has cluster configuration that relies on global init scripts. |
|
Job has cluster configuration or notebooks that use jars/Maven, Spark extensions, or custom data sources (from Spark). |
|
Job has notebooks which use PySpark UDFs. | Use Databricks Runtime 13.2 or above. |
Job has notebooks that have python code that makes network calls. | Use Databricks Runtime 12.2 or above. |
Job has notebooks that use Pandas UDFs (scalar). | Use Databricks Runtime 13.2 or above. |
Job will use Unity Catalog Volumes. | Use Databricks Runtime 13.3 or above. |
Job has notebooks that use |
|
Job has notebooks that uses Internal DButils API: Command Context and runs using a shared cluster. Workloads that try to access the command access, for example to retrieve a job id, using
| Instead of Requires Databricks Runtime 13.3 LTS+ |
Job has notebooks that uses PySpark: spark.udf.registerJavaFunction and runs using a shared cluster |
|
Job has notebooks that uses RDDs: sc.parallelize & spark.read.json() to convert a JSON object into a DF and runs using a shared cluster |
Example - Before:
After:
|
Job has notebooks that uses RDDs: Empty Dataframe via sc.emptyRDD() and runs using a shared cluster | Example - Before:
After:
|
Job has notebooks that uses RDDs: mapPartitions (expensive initialization logic + cheaper operations per row) and runs using a shared cluster |
|
Job has notebooks that uses SparkContext (sc) & sqlContext and runs using a shared cluster |
|
Job has notebooks that uses Spark Conf - sparkContext.getConf and runs using a shared cluster |
|
Job has notebooks that uses SparkContext - SetJobDescription() and runs using a shared cluster |
|
Job has notebooks that sets Spark Log Levels using commands, such as sc.setLogLevel("INFO"), and runs using a shared cluster |
|
Job has notebooks that uses deeply nested expressions / queries and runs using a shared cluster |
|
Job has notebooks that uses input_file_name() in the code and runs using a shared cluster |
|
Job has notebooks that performs data operations on DBFS filesystems and runs using a shared cluster |
|