Configure data access for ingestion
This article describes how admin users can configure access to data in a bucket in Amazon S3 (S3) so that Databricks users can load data from S3 into a table in Databricks.
This article describes the following ways to configure secure access to source data:
(Recommended) Create a Unity Catalog volume.
Create a Unity Catalog external location with a storage credential.
Launch a compute resource that uses an AWS instance profile.
Generate temporary credentials (an AWS access key ID, a secret key, and a session token).
Before you begin
Before you configure access to data in S3, make sure you have the following:
Data in an S3 bucket in your AWS account. To create a bucket, see Creating a bucket in the AWS documentation.
To access data using a Unity Catalog volume (recommended), the
READ VOLUME
privilege on the volume. For more information, see What are Unity Catalog volumes? and Unity Catalog privileges and securable objects.To access data using a Unity Catalog external location, the
READ FILES
privilege on the external location. For more information, see Create an external location to connect cloud storage to Databricks.
To access data using a compute resource with an AWS instance profile, Databricks workspace admin permissions.
A Databricks SQL warehouse. To create a SQL warehouse, see Create a SQL warehouse.
Familiarity with the Databricks SQL user interface.
Configure access to cloud storage
Use one of the following methods to configure access to S3:
(Recommended) Create a Unity Catalog volume. For more information, see What are Unity Catalog volumes?.
Configure a Unity Catalog external location with a storage credential. For more information about external locations, see Create an external location to connect cloud storage to Databricks.
Configure a compute resource to use an AWS instance profile. For more information, see Configure a SQL warehouse to use an instance profile.
Generate temporary credentials (an AWS access key ID, a secret key, and a session token) to share with other Databricks users. For more information, see Generate temporary credentials for ingestion.
Clean up
You can clean up the associated resources in your cloud account and Databricks if you no longer want to keep them.
Delete the AWS CLI named profile
In your ~/.aws/credentials
file for Unix, Linux, and macOS, or in your %USERPROFILE%\.aws\credentials
file for Windows, remove the following portion of the file, and then save the file:
[<named-profile>]
aws_access_key_id = <access-key-id>
aws_secret_access_key = <secret-access-key>
Delete the IAM user
Open the IAM console in your AWS account, typically at https://console.aws.amazon.com/iam.
In the sidebar, click Users.
Select the box next to the user, and then click Delete.
Enter the name of the user, and then click Delete.
Delete the IAM policy
Open the IAM console in your AWS account, if it is not already open, typically at https://console.aws.amazon.com/iam.
In the sidebar, click Policies.
Select the option next to the policy, and then click Actions > Delete.
Enter the name of the policy, and then click Delete.
Delete the S3 bucket
Open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3.
Select the option next to the bucket, and then click Empty.
Enter
permanently delete
, and then click Empty.In the sidebar, click Buckets.
Select the option next to the bucket, and then click Delete.
Enter the name of the bucket, and then click Delete bucket.
Next steps
After you complete the steps in this article, users can run the COPY INTO
command to load the data from the S3 bucket into your Databricks workspace.
To load data using a Unity Catalog volume or external location, see Load data using COPY INTO with Unity Catalog volumes or external locations.
To load data using a SQL warehouse with an AWS instance profile, see Load data using COPY INTO with an instance profile.
To load data using temporary credentials (an AWS access key ID, a secret key, and a session token), see Load data using COPY INTO with temporary credentials.