Create a Google Analytics Raw Data ingestion pipeline
The Google Analytics Raw Data connector is in Public Preview.
This article describes how to create a Google Analytics Raw Data ingestion pipeline using Databricks Lakeflow Connect and Google BigQuery. You can create the pipeline using the Databricks UI or Databricks APIs.
Before you begin
To create an ingestion pipeline, you must meet the following requirements:
-
Your workspace is enabled for Unity Catalog.
-
Serverless compute is enabled for your workspace. See Enable serverless compute.
-
If you plan to create a connection: You have
CREATE CONNECTION
privileges on the metastore.If you plan to use an existing connection: You have
USE CONNECTION
privileges orALL PRIVILEGES
on the connection object. -
You have
USE CATALOG
privileges on the target catalog. -
You have
USE SCHEMA
andCREATE TABLE
privileges on an existing schema orCREATE SCHEMA
privileges on the target catalog.
To ingest from GA4 using BigQuery, see Set up Google Analytics 4 and Google BigQuery for Databricks ingestion.
Configure networking
If you have serverless egress control enabled, allowlist the following URLs. Otherwise, skip this step. See Managing network policies for serverless egress control.
bigquery.googleapis.com
oauth2.googleapis.com
bigquerystorage.googleapis.com
googleapis.com
Create the ingestion pipeline
Permissions required: USE CONNECTION
or ALL PRIVILEGES
on a connection.
This step describes how to create the ingestion pipeline. Each ingested table is written to a streaming table with the same name.
- Databricks UI
- Databricks notebook
- Databricks CLI
-
In the sidebar of the Databricks workspace, click Data Ingestion.
-
On the Add data page, under Databricks connectors, click Google Analytics 4.
The ingestion wizard opens.
-
On the Ingestion pipeline page of the wizard, enter a unique name for the pipeline.
-
In the Destination catalog drop-down menu, select a catalog. Ingested data and event logs will be written to this catalog. You’ll select a destination schema later.
-
Select the Unity Catalog connection that stores the credentials required to access the source data.
If there are no existing connections to the source, click Create connection and enter the authentication details you obtained in Set up Google Analytics 4 and Google BigQuery for Databricks ingestion. You must have
CREATE CONNECTION
privileges on the metastore. -
Click Create pipeline and continue.
-
On the Source page, select the tables to ingest into Databricks, and then click Next.
-
On the Destination page, select the Unity Catalog catalog and schema to write to.
If you don't want to use an existing schema, click Create schema. You must have
USE CATALOG
andCREATE SCHEMA
privileges on the parent catalog. -
Click Save pipeline and continue.
-
(Optional) On the Settings page, click Create schedule. Set the frequency to refresh the destination tables.
-
(Optional) Set email notifications for pipeline operation success or failure.
-
Click Save and run pipeline.
-
Generate a personal access token and copy the token so you can paste it into a notebook later. See Databricks personal access tokens for workspace users.
-
Import the following notebook to your workspace:
Create a Google Analytics raw data ingestion pipeline
-
Modify the following values in the notebook:
Cell 1:
api_token
: The personal access token you generated
Cell 3:
name
: A name for the pipelineconnection_name
: The name of the Unity Catalog connection you created in Catalog Explorer (Catalog > External data > Connections). If you don't have an existing connection to the source, you can create one. You must have theCREATE CONNECTION
privilege on the metastore.source_catalog
: A Google Cloud Platform (GCP) project ID. If the source catalog is not specified, the connector assumes that the GCP project to ingest from is the one mentioned in the service account.source_schema
: A Google Analytics property name in the formatanalytics_XXXXXXXX
source_table
: The name of the source table:events
,events_intraday
,users
, orpseudonymous_users
destination_catalog
: A name for the destination catalog that will contain the ingested datadestination_schema
: A name for the destination schema that will contain the ingested datascd_type
: The SCD method to use:SCD_TYPE_1
orSCD_TYPE_2
. See What is SCD type 1 vs. type 2?.
-
Click Run all.
To create the pipeline:
databricks pipelines create --json "<pipeline definition or json file path>"
To edit the pipeline:
databricks pipelines update --json "<pipeline definition or json file path>"
To get the pipeline definition:
databricks pipelines get "<pipeline-id>"
To delete the pipeline:
databricks pipelines delete "<pipeline-id>"
For more information, run:
databricks pipelines --help
databricks pipelines <create|update|get|delete|...> --help
Update your pipeline schedule and notifications
-
After the pipeline has been created, revisit the Databricks workspace, and then click Pipelines.
The new pipeline appears in the pipeline list.
-
To view the pipeline details, click the pipeline name.
-
On the pipeline details page, you can schedule the pipeline by clicking Schedule.
-
To set notifications on the pipeline, click Settings, and then add a notification.