Create tables

Preview

Unity Catalog is in Public Preview. To participate in the preview, contact your Databricks representative.

This article introduces the concept of managed and external tables and describes how to create tables in Unity Catalog.

Requirements

  • You must be a Databricks account admin or a user with USAGE data permission on the parent catalog and the USAGE and CREATE permissions on the parent schema that will contain the table.

  • Your Databricks account must be on the Premium plan or above.

  • You must have a Unity Catalog metastore linked to the workspace where you perform the schema creation.

  • The compute resource that you use to run the notebook, Databricks SQL editor, or Data Explorer workflow to create the table must be compliant with Unity Catalog security requirements.

  • You must have a parent catalog and schema to contain the table. See Create and manage catalogs and Create and manage schemas (databases), along with the permissions listed in the first requirement.

There are additional requirements for creating an external table. See Create an external table.

Managed tables

Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the managed storage location you configured when you created each metastore. To create a managed table with SQL, run a CREATE TABLE command without a LOCATION clause. To delete a managed table with SQL, use the DROP TABLE statement. When a managed table is dropped, its underlying data is deleted from your cloud tenant. The only supported format for managed tables is Delta.

Example SQL syntax:

CREATE TABLE <example-table>
  (id STRING, value STRING);

External tables

External tables are tables whose data is stored in a storage location outside of the managed storage location, and are not fully managed by Unity Catalog. When you run DROP TABLE on an external table, Unity Catalog does not delete the underlying data. You can manage privileges on external tables and use them in queries in the the same way as managed tables. To create an external table with SQL, specify a LOCATION path in your CREATE TABLE statement. External tables can use the following file formats:

  • DELTA

  • CSV

  • JSON

  • AVRO

  • PARQUET

  • ORC

  • TEXT

To manage access to the underlying cloud storage for an external table, Unity Catalog introduces two new object types: storage credentials and external locations.

To learn more, see Create an external table.

Create a managed table

To create a managed table, run the following SQL command. You can also use the example notebook to create a table. Items in brackets are optional. Replace the placeholder values:

  • <catalog_name>: The name of the catalog.

  • <schema_name>: The name of the schema.

  • <table_name>: A name for the table.

  • <column_specification>: The name and data type for each column.

CREATE TABLE <catalog_name>.<schema_name>.<table_name>
(
  <column_specification>
);
spark.sql("CREATE TABLE <catalog_name>.<schema_name>.<table_name> "
  "("
  "  <column_specification>"
  ")")
library(SparkR)

sql(paste("CREATE TABLE <catalog_name>.<schema_name>.<table_name> ",
  "(",
  "  <column_specification>",
  ")",
  sep = ""))
spark.sql("CREATE TABLE <catalog_name>.<schema_name>.<table_name> " +
  "(" +
  "  <column_specification>" +
  ")")

You can also create a managed table by using the Databricks Terraform provider and databricks_table. You can retrieve a list of table full names by using databricks_tables.

For example, to create the table main.default.department and insert five rows into it:

CREATE TABLE main.default.department
(
  deptcode  INT,
  deptname  STRING,
  location  STRING
);

INSERT INTO main.default.department VALUES
  (10, 'FINANCE', 'EDINBURGH'),
  (20, 'SOFTWARE', 'PADDINGTON'),
  (30, 'SALES', 'MAIDSTONE'),
  (40, 'MARKETING', 'DARLINGTON'),
  (50, 'ADMIN', 'BIRMINGHAM');
spark.sql("CREATE TABLE main.default.department "
  "("
  "  deptcode  INT,"
  "  deptname  STRING,"
  "  location  STRING"
  ")"
  "INSERT INTO main.default.department VALUES "
  "  (10, 'FINANCE', 'EDINBURGH'),"
  "  (20, 'SOFTWARE', 'PADDINGTON'),"
  "  (30, 'SALES', 'MAIDSTONE'),"
  "  (40, 'MARKETING', 'DARLINGTON'),"
  "  (50, 'ADMIN', 'BIRMINGHAM')")
library(SparkR)

sql(paste("CREATE TABLE main.default.department ",
  "(",
  "  deptcode  INT,",
  "  deptname  STRING,",
  "  location  STRING",
  ")",
  "INSERT INTO main.default.department VALUES ",
  "  (10, 'FINANCE', 'EDINBURGH'),",
  "  (20, 'SOFTWARE', 'PADDINGTON'),",
  "  (30, 'SALES', 'MAIDSTONE'),",
  "  (40, 'MARKETING', 'DARLINGTON'),",
  "  (50, 'ADMIN', 'BIRMINGHAM')",
  sep = ""))
spark.sql("CREATE TABLE main.default.department " +
  "(" +
  "  deptcode  INT," +
  "  deptname  STRING," +
  "  location  STRING" +
  ")" +
  "INSERT INTO main.default.department VALUES " +
  "  (10, 'FINANCE', 'EDINBURGH')," +
  "  (20, 'SOFTWARE', 'PADDINGTON')," +
  "  (30, 'SALES', 'MAIDSTONE')," +
  "  (40, 'MARKETING', 'DARLINGTON')," +
  "  (50, 'ADMIN', 'BIRMINGHAM')")

Example notebooks

You can use the following example notebooks to create a catalog, schema, and table, and to manage permissions on them.

Create and manage a table in Unity Catalog with SQL

Open notebook in new tab

Create and manage a table in Unity Catalog with Python

Open notebook in new tab

Create an external table

The data in an external table is stored in a path on your cloud tenant. To work with external tables, Unity Catalog introduces two new objects to access and work with external cloud storage:

  • A storage credential contains an authentication method for accessing a cloud storage location. The storage credential does not contain a mapping to the path to which it grants access. Storage credentials are access-controlled to determine which users can use the credential. To use an external storage credential directly, add WITH <credential_name> to your SQL command.

  • An external location maps a storage credential with a cloud storage path to which it grants access. The external location grants access only to that cloud storage path and its contents. External locations are access-controlled to determine which users can use them. An external location is used automatically when your SQL command contains a LOCATION clause and no WITH <credential_name> clause.

Requirements

  • To create an external table, you must have:

    • The CREATE_TABLE privilege on an external location or storage credential, that grants access to the LOCATION accessed by the external table.

    • The USAGE permission on the table’s parent catalog and schema.

    • The CREATE permission on the table’s parent schema.

External locations and storage credentials are stored in the top level of the metastore, rather than in a catalog. To create a storage credential or an external location, you must be the metastore admin or an account-level admin.

To create an external table, follow these high-level steps. You can also use an example notebook to create the storage credential, external location, and external table and manage permissions for them.

Create a storage credential

Note

In the Databricks SQL Data Explorer, you can create storage credentials and view their details, but you cannot modify, rotate, or delete them. For those operations, you can use the Databricks SQL editor or a notebook.

  1. In AWS, create an S3 bucket. This S3 bucket will be the default storage location for managed tables in Unity Catalog. Use a dedicated S3 bucket for each metastore. Make a note of the S3 bucket path, which starts with s3://.

    Important

    The bucket name cannot include dot notation (.). For more bucket naming guidance, see the AWS bucket naming rules.

    If you enable KMS encryption on the S3 bucket, make a note of the name of the KMS encryption key.

  2. In AWS, create an IAM policy in the same AWS account as the S3 bucket. In the following policy, replace the following values:

    • <BUCKET>: The name of the S3 bucket from the previous step.

    • <KMS_KEY>: The name of the KMS key that encrypts the S3 bucket contents, if encryption is enabled. If encryption is disabled, remove the KMS section of the IAM policy.

    • <AWS_ACCOUNT_ID>: The Account ID of the current AWS account (not your Databricks account).

    • <AWS_IAM_ROLE_NAME>: The name of the AWS IAM role that will be created in the next step.

    {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Action": [
                 "s3:GetObject",
                 "s3:GetObjectVersion",
                 "s3:PutObject",
                 "s3:PutObjectAcl",
                 "s3:DeleteObject",
                 "s3:ListBucket",
                 "s3:GetBucketLocation"
             ],
             "Resource": [
                 "arn:aws:s3:::<BUCKET>/*",
                 "arn:aws:s3:::<BUCKET>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "kms:Decrypt",
                 "kms:Encrypt",
                 "kms:GenerateDataKey*"
             ],
             "Resource": [
                 "arn:aws:kms:<KMS_KEY>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "sts:AssumeRole"
             ],
             "Resource": [
                 "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_IAM_ROLE_NAME>"
             ],
             "Effect": "Allow"
         }
       ]
    }
    

    Note

    If you need a more restrictive IAM policy for Unity Catalog, contact your Databricks representative for assistance.

  3. Create an IAM role that uses the IAM policy you created in the previous step.

    1. Set EC2 as the trusted entity. This is a placeholder and has no effect on the trust relationship.

    2. In the Role’s Permission tab, attach the IAM Policy you just created.

    3. Set up a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on the behalf of Databricks users by pasting the following policy JSON into the Trust Relationship tab.

      • Do not modify the role ARN in the Principal section, which is a static value that references a role created by Databricks.

      • In the sts:ExternalId section, replace <DATABRICKS_ACCOUNT_ID> with your Databricks account ID from the first step (not your AWS account ID).

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "AWS": "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
              "StringEquals": {
                "sts:ExternalId": "<DATABRICKS_ACCOUNT_ID>"
              }
            }
          }
        ]
      }
      
  4. In Databricks, log in to a workspace that is linked to the metastore.

  5. From the persona switcher, select SQL.

  6. Click Data Icon Data.

  7. At the bottom of the screen, click Storage Credentials.

  8. Click Create credential.

  9. Enter a name for the credential, the IAM Role ARN that authorizes Unity Catalog to access the storage location on your cloud tenant, and an optional comment.

  10. Click Save.

You can grant permissions directly on the storage credential, but Databricks recommends that you reference it in an external location and grant permissions to that instead. An external location combines a storage credential with a specific path, and authorizes access only to that path and its contents.

Manage permissions on a storage credential

  1. Log in to a workspace that is linked to the metastore.

  2. From the persona switcher, select SQL.

  3. Click Data Icon Data.

  4. At the bottom of the screen, click Storage Credentials.

  5. Click the name of a storage credential to open its properties.

  6. Click Permissions.

  7. To grant permission to users or groups, select each identity, then click Grant.

  8. To revoke permissions from users or groups, deselect each identity, then click Revoke.

Create an external location

  1. From the persona switcher, select SQL.

  2. Click Data Icon Data.

  3. Click the + menu at the upper right and select Add an external location.

  4. Click Create location.

    1. Enter a name for the location.

    2. Optionally copy the storage container path from an existing mount point.

    3. If you aren’t copying from an existing mount point, enter an S3 bucket path.

    4. Select the storage credential that grants access to the location.

    5. Click Save.

Manage permissions on an external location

  1. Log in to a workspace that is linked to the metastore.

  2. From the persona switcher, select SQL.

  3. Click Data Icon Data.

  4. At the bottom of the screen, click External Locations.

  5. Click the name of an external location to open its properties.

  6. Click Permissions.

  7. To grant permission to users or groups, select each identity, then click Grant.

  8. To revoke permissions from users or groups, deselect each identity, then click Revoke.

Create an external table

You can create an external table using an external location (recommended) or using a storage credential directly. In the following examples, replace the placeholder values:

  • <catalog>: The name of the catalog that will contain the table.

  • <schema>: The name of the schema that will contain the table.

  • <table_name>: A name for the table.

  • <column_specification>: The name and data type for each column.

  • <bucket_path>: The path on your cloud tenant where the table will be created.

  • <table_directory>: A directory where the table will be created. Use a unique directory for each table.

Important

Once a table is created in a path, users can no longer directly access the files in that path even if they have been given privileges on an external location or storage credential to do so. This is to ensure that users cannot circumvent access controls applied to tables by reading files from your cloud tenant directly.

Create an external table using an external location

To create an empty external table using an external location:

CREATE TABLE <catalog>.<schema>.<table_name>
(
  <column_specification>
)
LOCATION 's3://<bucket_path>/<table_directory>';
spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "
  "("
  "  <column_specification>"
  ") "
  "LOCATION 's3://<bucket_path>/<table_directory>'")
library(SparkR)

sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",
  "(",
  "  <column_specification>",
  ") ",
  "LOCATION 's3://<bucket_path>/<table_directory>'",
  sep = ""))
spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +
  "(" +
  "  <column_specification>" +
  ") " +
  "LOCATION 's3://<bucket_path>/<table_directory>'")

Unity Catalog checks that you have the following permissions:

  • CREATE TABLE on the external location that references the cloud storage path you specify.

  • CREATE on the parent schema.

  • USAGE on the parent catalog.

If so, the external table is created. Otherwise, an error occurs and the external table is not created.

Note

You can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See Upgrade an external table to Unity Catalog.

You can also create an external table by using the Databricks Terraform provider and databricks_table. You can retrieve a list of table full names by using databricks_tables.

Create an external table using a storage credential directly

Note

Databricks recommends that you use external locations, rather than using storage credentials directly.

To create an external table using a storage credential, add a WITH (CREDENTIAL <credential_name>) clause to your SQL statement:

CREATE TABLE <catalog>.<schema>.<table_name>
(
  <column_specification>
)
LOCATION 's3://<bucket_path>/<table_directory>'
WITH (CREDENTIAL <storage_credential>);
spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "
  "( "
  "  <column_specification> "
  ") "
  "LOCATION 's3://<bucket_path>/<table_directory>' "
  "WITH (CREDENTIAL <storage_credential>)")
library(SparkR)

sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",
 "( ",
 "  <column_specification> ",
 ") ",
 "LOCATION 's3://<bucket_path>/<table_directory>' ",
 "WITH (CREDENTIAL <storage_credential>)",
 sep = ""))
spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +
  "( " +
  "  <column_specification> " +
  ") " +
  "LOCATION 's3://<bucket_path>/<table_directory>' " +
  "WITH (CREDENTIAL <storage_credential>)")

Unity Catalog checks whether you have the CREATE TABLE permission on the storage credential you specify, and whether the storage credential authorizes reading to and writing from the location you specified in the LOCATION clause. If both of these things are true, the external table is created. Otherwise, an error occurs and the external table is not created.

Example notebook

Create and manage an external table in Unity Catalog

Open notebook in new tab

Create a table from files stored in your cloud tenant

You can populate a managed or external table with records from files stored in your cloud tenant. Unity Catalog reads the files at that location and inserts their contents into the table. In Unity Catalog, this is called path-based-access.

Note

A storage path where you create an external table cannot also be used to read or write data files.

Explore the contents of the files

To explore data stored in external location:

  1. List the files in a cloud storage path:

    LIST 's3://<path_to_files>';
    
    display(spark.sql("LIST 's3://<path_to_files>'"))
    
    library(SparkR)
    
    display(sql("LIST 's3://<path_to_files>'"))
    
    display(spark.sql("LIST 's3://<path_to_files>'"))
    

    If you have the READ FILES permission on the external location associated with the cloud storage path, a list of data files in that location is returned.

  2. Query the data in the files in a given path:

    SELECT * FROM <format>.`s3://<path_to_files>`;
    
    display(spark.read.load("s3:://<path_to_files>"))
    
    library(SparkR)
    
    display(loadDF("s3:://<path_to_files>"))
    
    display(spark.read.load("s3:://<path_to_files>"))
    

    To explore data using a storage credential directly:

    SELECT * FROM <format>.`s3://<path_to_files>`;
    WITH (CREDENTIAL <storage_credential);
    
    display(spark.sql("SELECT * FROM <format>.`s3://<path_to_files>` "
      "WITH (CREDENTIAL <storage_credential)"))
    
    library(SparkR)
    
    display(sql(paste("SELECT * FROM <format>.`s3://<path_to_files>` ",
      "WITH (CREDENTIAL <storage_credential)",
      sep = "")))
    
    display(spark.sql("SELECT * FROM <format>.`s3://<path_to_files>` " +
      "WITH (CREDENTIAL <storage_credential)"))
    

Create a table from the files

Note

You can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See Upgrade an external table to Unity Catalog.

  1. Create a new table and populate it with records data files on your cloud tenant.

    Important

    • When you create a table using this method, the storage path is read only once, to prevent duplication of records. If you want to re-read the contents of the directory, you must drop and re-create the table. For an existing table, you can insert records from a storage path.

    • The bucket path where you create a table cannot also be used to read or write data files.

    • Only the files in the exact directory are read; the read is not recursive.

    • You must have the following permissions:

      • USAGE on the parent catalog and schema.

      • CREATE on the parent schema.

      • READ FILES on the external location associated with the bucket path where the files are located, or directly on the storage credential if you are not using an external location.

      • If you are creating an external table, you need CREATE TABLE on the bucket path where the table will be created.

    • To create a managed table and populate it with records from a bucket path:

      CREATE TABLE <catalog>.<schema>.<table_name>
      USING delta
      (
        <column_specification>
      )
      SELECT * from delta.`s3://<path_to_files>`;
      
      spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "
        "USING delta "
        "( "
        "  <column_specification> "
        ") "
        "SELECT * from delta.`s3://<path_to_files>`")
      
      library(SparkR)
      
      sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",
        "USING delta ",
        "( ",
        "  <column_specification> ",
        ") ",
        "SELECT * from delta.`s3://<path_to_files>`",
        sep = ""))
      
      spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +
        "USING delta " +
        "( " +
        "  <column_specification> " +
        ") " +
        "SELECT * from delta.`s3://<path_to_files>`")
      

      To use a storage credential, add WITH (CREDENTIAL <storage_credential>) to the command.

    • To create an external table and populate it with records from a bucket path, add a LOCATION clause:

      CREATE TABLE <catalog>.<schema>.<table_name>
      USING delta
      (
          <column_specification>
      )
      LOCATION `s3://<table_location>`
      SELECT * from <format>.`s3://<path_to_files>`;
      
      spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> "
        "USING delta "
        "( "
        "  <column_specification> "
        ") "
        "LOCATION `s3://<table_location>` "
        "SELECT * from <format>.`s3://<path_to_files>`")
      
      library(SparkR)
      
      sql(paste("CREATE TABLE <catalog>.<schema>.<table_name> ",
        "USING delta ",
        "( ",
        "  <column_specification> ",
        ") ",
        "LOCATION `s3://<table_location>` ",
        "SELECT * from <format>.`s3://<path_to_files>`",
        sep = ""))
      
      spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +
        "USING delta " +
        "( " +
        "  <column_specification> " +
        ") " +
        "LOCATION `s3://<table_location>` " +
        "SELECT * from <format>.`s3://<path_to_files>`")
      

      To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.

Insert records from a path into an existing table

To insert records from a bucket path into an existing table, use the COPY INTO command. In the following examples, replace the placeholder values:

  • <catalog>: The name of the table’s parent catalog.

  • <schema>: The name of the table’s parent schema.

  • <path_to_files>: The bucket path that contains the data files.

  • <format>: The format of the files.

  • <table_location>: The bucket path where the table will be created.

  • <storage_credential>: If you are using a storage credential directly, the name of the storage credential that authorizes reading from or writing to the bucket path.

Important

  • When you insert records into a table using this method, the bucket path you provide is read only once, to prevent duplication of records.

  • The bucket path where you create a table cannot also be used to read or write data files.

  • Only the files in the exact directory are read; the read is not recursive.

  • You must have the following permissions:

    • USAGE on the parent catalog and schema.

    • MODIFY on the table.

    • READ FILES on the external location associated with the bucket path where the files are located, or directly on the storage credential if you are not using an external location.

    • To insert records into an external table, you need CREATE TABLE on the bucket path where the table is located.

  • To insert records from files in a bucket path into a managed table, using an external location to read from the bucket path:

    COPY INTO <catalog>.<schema>.<table>
    FROM (
      SELECT *
      FROM 's3://<path_to_files>'
    )
    FILEFORMAT = <format>;
    
    spark.sql("COPY INTO <catalog>.<schema>.<table> "
      "FROM ( "
      "  SELECT * "
      "  FROM 's3://<path_to_files>' "
      ") "
      "FILEFORMAT = <format>")
    
    library(SparkR)
    
    sql(paste("COPY INTO <catalog>.<schema>.<table> ",
      "FROM ( ",
      "  SELECT * ",
      "  FROM 's3://<path_to_files>' ",
      ") ",
      "FILEFORMAT = <format>",
      sep = ""))
    
    spark.sql("COPY INTO <catalog>.<schema>.<table> " +
      "FROM ( " +
      "  SELECT * " +
      "  FROM 's3://<path_to_files>' " +
      ") " +
      "FILEFORMAT = <format>")
    

    To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.

  • To insert into an external table, add a LOCATION clause:

    COPY INTO <catalog>.<schema>.<table>
    LOCATION `s3://<table_location>`
    FROM (
      SELECT *
      FROM 's3://<path_to_files>'
    )
    FILEFORMAT = <format>;
    
    spark.sql("COPY INTO <catalog>.<schema>.<table> "
      "LOCATION `s3://<table_location>` "
      "FROM ( "
      "  SELECT * "
      "  FROM 's3://<path_to_files>' "
      ") "
      "FILEFORMAT = <format>")
    
    library(SparkR)
    
    sql(paste("COPY INTO <catalog>.<schema>.<table> ",
      "LOCATION `s3://<table_location>` ",
      "FROM ( ",
      "  SELECT * ",
      "  FROM 's3://<path_to_files>' ",
      ") ",
      "FILEFORMAT = <format>",
      sep = ""))
    
    spark.sql("COPY INTO <catalog>.<schema>.<table> " +
      "LOCATION `s3://<table_location>` " +
      "FROM ( " +
      "  SELECT * " +
      "  FROM 's3://<path_to_files>' " +
      ") " +
      "FILEFORMAT = <format>")
    

    To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.