Connect to StreamSets

Preview

StreamSets helps you to manage and monitor your data flow throughout its lifecycle. StreamSets native integration with Databricks and Delta Lake allows you to pull data from various sources and manage your pipelines easily.

For a general demonstration of StreamSets, watch the following YouTube video (10 minutes).

Here are the steps for using StreamSets with Databricks.

Step 1: Generate a Databricks personal access token

StreamSets authenticates with Databricks using a Databricks personal access token.

note

As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens.

If you use personal access token authentication, Databricks recommends using personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Step 2: Set up a cluster to support integration needs

StreamSets will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket.

Secure access to an S3 bucket

To access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in Tutorial: Configure S3 access with an instance profile.

As an alternative, you can use IAM credential passthrough, which enables user-specific access to S3 data from a shared cluster.

Specify the cluster configuration

Set Cluster Mode to Standard.
Set Databricks Runtime Version to Runtime: 6.3 or above.
Enable optimized writes and auto compaction by adding the following properties to your Spark configuration:
ini
```
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
```
Configure your cluster depending on your integration and scaling needs.

For cluster configuration details, see Compute configuration reference.

See Get connection details for a Databricks compute resource for the steps to obtain the JDBC URL and HTTP path.

Step 3: Obtain JDBC and ODBC connection details to connect to a cluster

To connect a Databricks cluster to StreamSets you need the following JDBC/ODBC connection properties:

JDBC URL
HTTP Path

Step 4: Get StreamSets for Databricks

Sign up for StreamSets for Databricks, if you do not already have a StreamSets account. You can get started for free and upgrade when you're ready; see StreamSets DataOps Platform Pricing.

Step 5: Learn how to use StreamSets to load data into Delta Lake

Start with a sample pipeline or see Loading Data into Databricks Delta Lake to learn how to build a pipeline that ingests data into Delta Lake.

Additional resources

Support

Step 1: Generate a Databricks personal access token​

Step 2: Set up a cluster to support integration needs​

Secure access to an S3 bucket​

Specify the cluster configuration​

Step 3: Obtain JDBC and ODBC connection details to connect to a cluster​

Step 4: Get StreamSets for Databricks​

Step 5: Learn how to use StreamSets to load data into Delta Lake​

Additional resources​