Skip to main content

UDFs for ABAC policies best practices

This page shows how to write high-performance user-defined functions (UDFs) for use in ABAC row filter and column mask policies in Unity Catalog.

Why UDF performance matters

ABAC policies run on every query execution for every applicable row or column. Inefficient UDFs can:

  • Block parallel execution and predicate pushdown
  • Force expensive joins or lookups per row
  • Increase query time from milliseconds to seconds (or more)
  • Break caching and vectorization
  • Make analytics and reporting unreliable

In production environments, this can make data governance a bottleneck instead of an enabler.

Golden rules for ABAC UDFs

  • Keep it simple: favor basic CASE statements and clear boolean expressions.
  • Stay deterministic: the same input should always produce the same output for caching and consistent joins.
  • Avoid external calls: no API calls or lookups to other databases.
  • Use only built-in functions inside your UDF: don’t call other UDFs from within a UDF.
  • Test at scale: validate performance on at least 1 million rows.
  • Minimize complexity: avoid multi-level nesting and unnecessary function calls.
  • Column-only references: reference only the target table's columns to enable pushdown.

Common anti-patterns to avoid

  • External API calls inside UDFs: causes network latency, timeouts, and single points of failure.
  • Complex subqueries or joins: forces nested loop joins and prevents optimization.
  • Heavy regex on large text: CPU- and memory-intensive per row.
  • Metadata lookups: avoid per-row checks that hit metadata or identity sources, such as information_schema queries, user profile lookups, is_member(), or is_account_group_member(). Adds extra scans for each record.
  • Dynamic SQL generation: no query optimization and prevents caching.
  • Non-deterministic logic: prevents caching and consistent joins, see Impact of non-deterministic logic.

Keep access checks in policies, not UDFs

A common mistake is calling is_account_group_member() or is_member() directly inside a UDF. This makes the function slower and makes the UDF more difficult to reuse.

Instead, follow this pattern:

  • UDF role: Focus only on how to transform, mask, or filter the data. Use only the columns and parameters passed into it.
  • Policy role: Define who (principals, groups) and when (tags) the UDF should apply by referencing principals in the ABAC policy.

Impact of non-deterministic logic

Some scenarios (such as randomized masking for research) require a different output each time.
If you must use non-deterministic functions:

  • Expect slower performance due to no caching.
  • JOINs might fail or return inconsistent results.
  • Reports might show different data between runs.
  • Troubleshooting and validation might be harder.

UDF examples

Below are production-friendly patterns for column masking and row filtering. All examples follow the ABAC performance best practices: simple logic, deterministic behavior, no external calls, and use of only built-in functions.

Column mask: fast and deterministic

SQL
-- Deterministically pseudonymize patient_id with a version tag for rotation.
CREATE OR REPLACE FUNCTION mask_patient_id_fast(patient_id STRING)
RETURNS STRING
DETERMINISTIC
RETURN CONCAT('REF_', SHA2(CONCAT(patient_id, ':v1'), 256));

Why this works:

  • Simple CASE statement
  • No external dependencies
  • Deterministic results for consistent joins
  • Keep principal logic out of the UDF

Column mask: partial reveal without regex hotspots

SQL
-- Reveal only the last 4 digits of an SSN, masking the rest.
CREATE OR REPLACE FUNCTION mask_ssn_last4(ssn STRING)
RETURNS STRING
DETERMINISTIC
RETURN CASE
WHEN ssn IS NULL THEN NULL
WHEN LENGTH(ssn) >= 4 THEN CONCAT('XXX-XX-', RIGHT(REGEXP_REPLACE(ssn, '[^0-9]', ''), 4))
ELSE 'MASKED'
END;

Why this works:

  • Uses a single lightweight regex to strip non-digits
  • Avoids multiple regex passes on large text fields

Column mask: deterministic pseudonymization with versioning

SQL
-- Create a consistent pseudonymized reference ID.
CREATE OR REPLACE FUNCTION mask_id_deterministic(id STRING)
RETURNS STRING
DETERMINISTIC
RETURN CONCAT('REF_', SHA2(CONCAT(id, ':v1'), 256));

Why this works:

  • Preserves joins across masked datasets
  • Includes a version tag (:v1) to support key rotation without breaking historical data intentionally

Row filter: pushdown-friendly by region

SQL
-- Returns TRUE if the row's state is in the allowed set.
CREATE OR REPLACE FUNCTION filter_by_region(state STRING, allowed ARRAY<STRING>)
RETURNS BOOLEAN
DETERMINISTIC
RETURN array_contains(TRANSFORM(allowed, x -> lower(x)), lower(state));

Why this works:

  • Simple boolean logic
  • References only table columns
  • Enables predicate pushdown and vectorization

Row filter: deterministic multi-condition

SQL
-- Returns TRUE if the row's region is in the allowed set.
CREATE OR REPLACE FUNCTION filter_region_in(region STRING, allowed_regions ARRAY<STRING>)
RETURNS BOOLEAN
DETERMINISTIC
RETURN array_contains(TRANSFORM(allowed_regions, x -> lower(x)), lower(region));

Why this works:

  • Supports multiple roles and geographies in one function
  • Keeps logic flat for better optimization

Testing UDF performance

Use synthetic scale tests to validate behavior and performance before production. For example:

SQL
WITH test_data AS (
SELECT
patient_id,
your_mask_function(patient_id) as masked_id,
current_timestamp() as start_time
FROM (
SELECT CONCAT('PAT', LPAD(seq, 6, '0')) as patient_id
FROM range(1000000) -- 1 million test rows
)
)
SELECT
COUNT(*) as rows_processed,
MAX(start_time) - MIN(start_time) as total_duration,
COUNT(*) / EXTRACT(EPOCH FROM (MAX(start_time) - MIN(start_time))) as rows_per_second
FROM test_data;