Google Analytics Raw Data connector concepts

Preview

The Google Analytics Raw Data connector is in Public Preview.

The Google Analytics Raw Data connector allows you to ingest raw, event-level data from Google Analytics 4 (GA4) using Databricks Lakeflow Connect and Google BigQuery.

How does GA4 ingestion work?

First, you must export your GA4 data to BigQuery using Google's provided APIs or UIs. Then, Databricks consumes the data from BigQuery using the following APIs:

The BigQuery API for metadata operations (for example, to list tables and schemas)
The BigQuery Storage API for data ingestion
The Cloud Resource Manager API for schema exploration

Connector data model

The GA4 connector can ingest the following tables from a given GA4 property:

events
events_intraday
users
pseudonymous_users

For each day that data arrives in GA4, a date-partitioned table is automatically created in BigQuery. The BigQuery table name has the format <table_name>_YYYYMMDD (for example, events_20241024).

During each Lakeflow Connect pipeline update, the connector automatically ingests any new tables since the last update. It also ingests any new rows in existing tables for up to 72 hours.

Connector basics

On the initial run of the pipeline, the connector ingests all of the data that you've exported to BigQuery for the tables that you've selected.
On subsequent pipeline runs, the connector ingests newly inserted rows, with the caveats outlined in this article.
Updates and deletes are not ingested.
The initial load fetches the data for all dates that are present in your GA4/BigQuery project.
The connector assumes that each row is unique. Databricks can't guarantee correct behavior if there are unexpected duplicates.

Update windows and schedules

GA4 can continue to update tables for up to 72 hours after they're created. Therefore, Databricks tracks and ingests updates on those tables for 72 hours. The connector doesn't automatically ingest updates to the tables after the 72-hour update window (for example, if GA4 reprocesses historical data).

You should run your Lakeflow Connect pipeline at least every 72 hours, but Databricks recommends running the pipeline daily. Syncing less frequently increases the risk that the connector will need to refetch data.

Databricks also recommends maintaining BigQuery's default time travel window of 7 days. This can help with ingestion efficiency.

Table-level data models and other key information

events and events_intraday tables

For the events table and the events_intraday table, one row in Databricks corresponds to one row in BigQuery.

For the events_intraday table, there is no guarantee that the data will exist for a particular date after the data for the same date is available in the events table. This is because the events_intraday table is only intended for interim use until the events table is ready for that day.

users table

To ingest from the users table, the connector relies on the user_id as the primary key and the last_updated_date as the cursor key. As a result, it only ingests one row per user ID from each users table: the entry with the largest last_updated_date.

To preserve more than one row per user ID in the destination table, set the SCD mode to type 2 in the table configuration.

pseudonymous_users table

To ingest from the pseudonymous_users table, the connector relies on the pseudo_user_id and the stream_id as the primary keys. It uses the last_updated_date as the cursor key. As a result, it only ingests one row per pseudo user ID from each pseudonymous_users table: the entry with the largest last_updated_date.

To preserve more than one row per user ID in the destination table, set the SCD mode to type 2 in the table configuration.

How does GA4 ingestion work?​

Connector data model​

Connector basics​

Update windows and schedules​

Table-level data models and other key information​

events and events_intraday tables​

users table​

pseudonymous_users table​