Load data using a Unity Catalog external location
Preview
This feature is in Public Preview.
This article describes how to use the add data UI to create a managed table from data in Amazon S3 using a Unity Catalog external location. An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.
Databricks recommends using Unity Catalog external locations to access data in cloud object storage. The legacy S3 table import page only supports creating tables in the legacy Hive metastore and requires that you select a compute resource that uses an instance profile.
Before you begin
Before you begin, you must have the following:
A workspace with Unity Catalog enabled. For more information, see Set up and manage Unity Catalog.
The
READ FILES
privilege on the external location. For more information, see Create an external location to connect cloud storage to Databricks.The
CREATE TABLE
privilege on the schema in which you want to create the managed table, theUSE SCHEMA
privilege on the schema, and theUSE CATALOG
privilege on the parent catalog. For more information, see Unity Catalog privileges and securable objects.
Step 1: Confirm access to the external location
To confirm access to the external location, do the following:
In the sidebar of your Databricks workspace, click Catalog.
In Catalog Explorer, click External Data > External Locations.
Step 2: Create the managed table
To create the managed table, do the following:
In the sidebar of your workspace, click + New > Add data.
In the add data UI, click Amazon S3.
Select an external location from the drop-down list.
Select the folders and the files that you want to load into Databricks, and then click Preview table.
Select a catalog and a schema from the drop-down lists.
(Optional) Edit the table name.
(Optional) To set advanced format options by file type, click Advanced attributes, turn off Automatically detect file type, and then select a file type.
For a list of format options, see the following section.
(Optional) To edit the column name, click the input box at the top of the column.
Column names don’t support commas, backslashes, or unicode characters (such as emojis).
(Optional) To edit column types, click the icon with the type.
Click Create table.
File type format options
The following format options are available, depending on the file type:
Format option |
Description |
Supported file types |
---|---|---|
|
The separator character between columns. Only a single character is allowed, and backslash is not supported. The default is a comma. |
CSV |
|
The escape character to use when parsing the data. The default is a quotation mark. |
CSV |
|
This option specifies whether the file contains a header. Enabled by default. |
CSV |
|
Automatically detect file type. Default is |
XML |
|
Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as STRING. Enabled by default. |
|
|
Whether a column’s value can span multiple lines in the file. Disabled by default. |
|
|
Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default. |
CSV |
|
Whether comments are allowed in the file. Enabled by default. |
JSON |
|
Whether single quotes are allowed in the file. Enabled by default. |
JSON |
|
Whether to try to infer timestamp strings as Enabled by default. |
JSON |
|
Whether to save columns that don’t match the schema. For more information, see What is the rescued data column?. Enabled by default. |
|
|
Whether to exclude attributes in elements. Default is |
XML |
|
The prefix for attributes to differentiate attributes and elements. Default is |
XML |
Column data types
The following column data types are supported. For more information about individual data types see SQL data types.
Data Type |
Description |
---|---|
|
8-byte signed integer numbers. |
|
Boolean ( |
|
and day, without a time-zone. |
|
Numbers with maximum precision |
|
8-byte double-precision floating point numbers. |
|
Character string values. |
|
Values comprising values of fields year, month, day, hour, minute, and second, with the session local timezone. |
Known issues
You might experience issues with special characters in complex data types, such as a JSON object with a key containing a backtick or a colon.
Some JSON files might require that you manually select JSON for the file type. To manually select a file type after you select files, click Advanced attributes, turn off Automatically detect file type, and then select JSON.
Nested timestamps and decimals inside complex types might encounter issues.