Connect to Infoworks
Preview
This feature is in Public Preview.
Infoworks DataFoundry is an automated enterprise data operations and orchestration system that runs natively on Databricks and leverages the full power of Databricks to deliver a easy solution for data onboarding—an important first step in operationalizing your data lake. DataFoundry not only automates data ingestion, but also automates the key functionality that must accompany ingestion to establish a foundation for analytics. Data onboarding with DataFoundry automates:
Data ingestion: from all enterprise and external data sources
Data synchronization: CDC to keep data synchronized with the source
Data governance: cataloging, lineage, metadata management, audit, and history
Here are the steps for using Infoworks with Databricks.
Step 1: Generate a Databricks personal access token
Infoworks authenticates with Databricks using a Databricks personal access token.
Note
As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens.
If you use personal access token authentication, Databricks recommends using personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Step 2: Set up a cluster to support integration needs
Infoworks will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket.
Secure access to an S3 bucket
To access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in Tutorial: Configure S3 access with an instance profile.
As an alternative, you can use IAM credential passthrough, which enables user-specific access to S3 data from a shared cluster.
Specify the cluster configuration
Set Cluster Mode to Standard.
Set Databricks Runtime Version to a Databricks runtime version.
Enable optimized writes and auto compaction by adding the following properties to your Spark configuration:
spark.databricks.delta.optimizeWrite.enabled true spark.databricks.delta.autoCompact.enabled true
Configure your cluster depending on your integration and scaling needs.
For cluster configuration details, see Compute configuration reference.
See Get connection details for a Databricks compute resource for the steps to obtain the JDBC URL and HTTP path.
Step 3: Obtain JDBC and ODBC connection details to connect to a cluster
To connect a Databricks cluster to Infoworks you need the following JDBC/ODBC connection properties:
JDBC URL
HTTP Path
Step 4: Get Infoworks for Databricks
Go to Infoworks to learn more and get a demo.