Select rows to ingest
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
Applies to: API-based pipeline authoring
Row filtering allows you to ingest only the data you need by applying conditions similar to a SQL WHERE clause. This improves performance (especially for initial loads with historical data) and minimizes data duplication (especially in development environments).
Supported connectors
- Google Analytics
- Salesforce
- ServiceNow
How row filtering works
Row filtering acts like a WHERE filter in SQL. You can compare values in the source against integers, booleans, strings, and other data types. You can also use complex combinations of clauses to pull only the data you need.
Row filtering applies during both the initial load and subsequent incremental updates.
Limitations
Row filtering has the following limitations:
-
Salesforce: Row filtering is only supported on two columns: the primary key (ID, if available) and the cursor column. The connector selects the cursor column from the following list, in order of preference:
SystemModstamp,LastModifiedDate,CreatedDate, andLoginTime. -
ServiceNow:
- Only the
ANDoperator is supported. TheORoperator is not currently available. For example,u_age = 40 AND u_active = TRUEworks, butu_age = 40 OR u_active = TRUEdoes not. - Timestamps in the filters must be in the following format:
YYYY-MM-DD HH: mm:SS(for example,2004-03-02 17:14:59).
- Only the
-
Row or query updates: The connector does not delete a row when it matches the filter on the initial load but either the row or the query is updated such that it no longer matches on a subsequent load. The connector also does not ingest a row that did not match the query in a previous pipeline update but now matches in a subsequent update.
Configure row filtering
To configure a pipeline with row filtering, add the row_filter config to your pipeline specification. For example:
pipeline_spec = """
{
"name": "...",
"ingestion_definition": {
"connection_name": "...",
"objects": [
{
"table": {
"source_schema": "...",
"source_table": "...",
"destination_catalog": "...",
"destination_schema": "...",
"destination_table": "...",
"table_configuration": {
"row_filter": "details go here; see examples below"
}
}
}
]
},
"channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)
Examples
- Salesforce
- Google Analytics
- ServiceNow
Ingest data after a certain system timestamp:
"row_filter": "SystemModstamp > '2025-06-10T23:40:11.000-07:00'"
Ingest a specific row:
"row_filter": "Id = 'a00Qy00000vps2NIAQ'"
Ingest data after a certain event timestamp:
"row_filter": "event_timestamp > 1712224270703246"
Ingest data for active users:
"row_filter": "is_active_user = TRUE"
Ingest data for non-web platforms:
"row_filter": "platform != 'WEB'"
Ingest data with multiple conditions:
"row_filter": "event_timestamp > 1712224270703246 AND (platform != 'WEB' OR is_active_user = FALSE)"
Ingest data after a certain event timestamp:
"row_filter": "sys_updated_on > 2004-03-02 17:14:59"
Ingest data for active users:
"row_filter": "u_active = TRUE"
Ingest data for specific users:
"row_filter": "u_name = 'johnsmith'"
Ingest data with multiple conditions:
"row_filter": "u_active = TRUE AND u_name = 'johnsmith'"
Supported operators
The following table shows which operators are supported for row filtering:
Operator | Supported |
|---|---|
| Yes |
| Salesforce and Google Analytics only |
| Yes |
| Yes |
| No |
| No |
| Yes |
| Yes |
FAQ
Find answers to frequently asked questions about row filtering.
What happens if a row fails to match the row filter on the initial load but is later updated to match it on a subsequent load?
The row is ingested during the next pipeline update. This does not require a refresh.
What happens if a row matches the row filter on the initial load but is later updated to no longer match it?
The row is not deleted during the next pipeline update.
What happens if I update the query and a previously uningested row now matches?
The row is not ingested during the next pipeline update. This requires a full refresh.
What happens if I update the query and a previously ingested row no longer matches?
The row is not deleted during the next pipeline update.