Google Analytics Raw Data connector concepts
The Google Analytics Raw Data connector is in Public Preview.
The Google Analytics Raw Data connector allows you to ingest raw, event-level data from Google Analytics 4 (GA4) using Databricks Lakeflow Connect and Google BigQuery.
How does GA4 ingestion work?
First, you must export your GA4 data to BigQuery using Google’s provided APIs or UIs. Then, Databricks consumes the data from BigQuery using the following APIs:
- The BigQuery API for metadata operations (for example, to list tables and schemas)
- The BigQuery Storage API for data ingestion
- The Cloud Resource Manager API
Connector data model
The GA4 connector can ingest the following tables from a given GA4 property:
events
events_intraday
users
pseudonymous_users
For each day that data arrives in GA4, a date-partitioned table is automatically created in BigQuery. The BigQuery table name has the format <table_name>_YYYYMMDD
(for example, events_20241024
).
During each Lakeflow Connect pipeline update, the connector automatically ingests any new tables since the last update. It also ingests any new rows in existing tables for up to 72 hours.
Connector basics
-
On the initial run of the pipeline, the connector ingests all of the data that you’ve exported to BigQuery for the tables that you’ve selected.
-
On subsequent pipeline runs, the connector ingests newly inserted rows, with the caveats outlined in this article.
-
Updates and deletes are not ingested.
-
The initial load fetches the data for all dates that are present in your GA4/BigQuery project.
-
The connector assumes that each row is unique. Databricks can't guarantee correct behavior if there are unexpected duplicates.
Update windows and schedules
GA4 can continue to update tables for up to 72 hours after they’re created. Therefore, Databricks tracks and ingests updates on those tables for 72 hours. The connector doesn’t automatically ingest updates to the tables after the 72-hour update window (for example, if GA4 reprocesses historical data).
You should run your Lakeflow Connect pipeline at least every 72 hours, but Databricks recommends running the pipeline daily. Syncing less frequently increases the risk that the connector will need to refetch data.
Databricks also recommends maintaining BigQuery's default time travel window of 7 days. This can help with ingestion efficiency.
Table-level data models and other key information
events and events_intraday tables
For the events
table and the events_intraday
table, one row in Databricks corresponds to one row in BigQuery.
For the events_intraday
table, there is no guarantee that the data will exist for a particular date after the data for the same date is available in the events
table. This is because the events_intraday
table is only intended for interim use until the events
table is ready for that day.
users table
To ingest from the users
table, the connector relies on the user_id
as the primary key and the last_updated_date
as the cursor key. As a result, it only ingests one row per user ID from each users
table: the entry with the largest last_updated_date
.
To preserve more than one row per user ID in the destination table, set the SCD mode to type 2 in the table configuration.
pseudonymous_users table
To ingest from the pseudonymous_users
table, the connector relies on the pseudo_user_id
and the stream_id
as the primary keys. It uses the last_updated_date
as the cursor key. As a result, it only ingests one row per pseudo user ID from each pseudonymous_users
table: the entry with the largest last_updated_date
.
To preserve more than one row per user ID in the destination table, set the SCD mode to type 2 in the table configuration.