This feature is in Public Preview.
StreamSets helps you to manage and monitor your data flow throughout its lifecycle. StreamSets native integration with Databricks and Delta Lake allows you to pull data from various sources and manage your pipelines easily.
For a general demonstration of StreamSets, watch the following YouTube video (10 minutes).
Here are the steps for using StreamSets with Databricks.
StreamSets authenticates with Databricks using a Databricks personal access token. To generate a personal access token, follow the instructions in Generate a personal access token.
StreamSets will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket.
To access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in Secure access to S3 buckets using instance profiles.
As an alternative, you can use IAM credential passthrough, which enables user-specific access to S3 data from a shared cluster.
Set Cluster Mode to Standard.
Set Databricks Runtime Version to Runtime: 6.3 or above.
spark.databricks.delta.optimizeWrite.enabled true spark.databricks.delta.autoCompact.enabled true
Configure your cluster depending on your integration and scaling needs.
For cluster configuration details, see Configure clusters.
See Retrieve the connection details for the steps to obtain the JDBC URL and HTTP path.
To connect a Databricks cluster to StreamSets you need the following JDBC/ODBC connection properties:
- JDBC URL
- HTTP Path
Register and start up StreamSets for Databricks on AWS.
Start with a sample pipeline or check out StreamSets solutions to learn how to build a pipeline that ingests data into Delta Lake.