Skip to main content

Select rows to ingest

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

Applies to: check marked yes API-based pipeline authoring

Row filtering allows you to ingest only the data you need by applying conditions similar to a SQL WHERE clause. This improves performance (especially for initial loads with historical data) and minimizes data duplication (especially in development environments).

Supported connectors

  • Google Analytics
  • Salesforce
  • ServiceNow

How row filtering works

Row filtering acts like a WHERE filter in SQL. You can compare values in the source against integers, booleans, strings, and other data types. You can also use complex combinations of clauses to pull only the data you need.

Row filtering applies during both the initial load and subsequent incremental updates.

Limitations

Row filtering has the following limitations:

  • Salesforce: Row filtering is only supported on two columns: the primary key (ID, if available) and the cursor column. The connector selects the cursor column from the following list, in order of preference: SystemModstamp, LastModifiedDate, CreatedDate, and LoginTime.

  • ServiceNow:

    • Only the AND operator is supported. The OR operator is not currently available. For example, u_age = 40 AND u_active = TRUE works, but u_age = 40 OR u_active = TRUE does not.
    • Timestamps in the filters must be in the following format: YYYY-MM-DD HH: mm:SS (for example, 2004-03-02 17:14:59).
  • Row or query updates: The connector does not delete a row when it matches the filter on the initial load but either the row or the query is updated such that it no longer matches on a subsequent load. The connector also does not ingest a row that did not match the query in a previous pipeline update but now matches in a subsequent update.

Configure row filtering

To configure a pipeline with row filtering, add the row_filter config to your pipeline specification. For example:

Python
pipeline_spec = """
{
"name": "...",
"ingestion_definition": {
"connection_name": "...",
"objects": [
{
"table": {
"source_schema": "...",
"source_table": "...",
"destination_catalog": "...",
"destination_schema": "...",
"destination_table": "...",
"table_configuration": {
"row_filter": "details go here; see examples below"
}
}
}
]
},
"channel": "PREVIEW"
}
"""
create_pipeline(pipeline_spec)

Examples

Ingest data after a certain system timestamp:

JSON
"row_filter": "SystemModstamp > '2025-06-10T23:40:11.000-07:00'"

Ingest a specific row:

JSON
"row_filter": "Id = 'a00Qy00000vps2NIAQ'"

Supported operators

The following table shows which operators are supported for row filtering:

Operator

Supported

AND

Yes

OR

Salesforce and Google Analytics only

=

Yes

!=

Yes

LIKE

No

IN

No

< <=

Yes

> >=

Yes

FAQ

Find answers to frequently asked questions about row filtering.

What happens if a row fails to match the row filter on the initial load but is later updated to match it on a subsequent load?

The row is ingested during the next pipeline update. This does not require a refresh.

What happens if a row matches the row filter on the initial load but is later updated to no longer match it?

The row is not deleted during the next pipeline update.

What happens if I update the query and a previously uningested row now matches?

The row is not ingested during the next pipeline update. This requires a full refresh.

What happens if I update the query and a previously ingested row no longer matches?

The row is not deleted during the next pipeline update.

Additional resources