UDFs for ABAC policies best practices
This feature is in Public Preview.
This page shows how to write high-performance user-defined functions (UDFs) for use in ABAC row filter and column mask policies in Unity Catalog.
Why UDF performance matters
ABAC policies run on every query execution for every applicable row or column. Inefficient UDFs can:
- Block parallel execution and predicate pushdown
- Force expensive joins or lookups per row
- Increase query time from milliseconds to seconds (or more)
- Break caching and vectorization
- Make analytics and reporting unreliable
In production environments, this can make data governance a bottleneck instead of an enabler.
Golden rules for ABAC UDFs
- Keep it simple: favor basic
CASEstatements and clear boolean expressions. - Stay deterministic: the same input should always produce the same output for caching and consistent joins.
- Avoid external calls: no API calls or lookups to other databases.
- Use only built-in functions inside your UDF: don’t call other UDFs from within a UDF.
- Test at scale: validate performance on at least 1 million rows.
- Minimize complexity: avoid multi-level nesting and unnecessary function calls.
- Column-only references: reference only the target table's columns to enable pushdown.
- Prefer SQL to Python UDFs: for better performance. If you must use Python, explicitly mark UDFs as
DETERMINISTICwhen they don't involve non-deterministic logic and dependencies.
Common anti-patterns to avoid
- External API calls inside UDFs: causes network latency, timeouts, and single points of failure.
- Complex subqueries or joins: forces nested loop joins and prevents optimization.
- Heavy regex on large text: CPU- and memory-intensive per row.
- Metadata lookups: avoid per-row checks that hit metadata or identity sources, such as
information_schemaqueries, user profile lookups,is_member(), oris_account_group_member(). Adds extra scans for each record. - Dynamic SQL generation: no query optimization and prevents caching.
- Non-deterministic logic: prevents caching and consistent joins, see Impact of non-deterministic logic.
Keep access checks in policies, not UDFs
A common mistake is calling is_account_group_member() or is_member() directly inside a UDF. This makes the function slower and makes the UDF more difficult to reuse.
Instead, follow this pattern:
- UDF role: Focus only on how to transform, mask, or filter the data. Use only the columns and parameters passed into it.
- Policy role: Define who (principals, groups) and when (tags) the UDF should apply by referencing principals in the ABAC policy.
Impact of non-deterministic logic
Some scenarios (such as randomized masking for research) require a different output each time.
If you must use non-deterministic functions:
- Expect slower performance due to no caching.
- JOINs might fail or return inconsistent results.
- Reports might show different data between runs.
- Troubleshooting and validation might be harder.
UDF examples
Below are production-friendly patterns for column masking and row filtering. All examples follow the ABAC performance best practices: simple logic, deterministic behavior, no external calls, and use of only built-in functions.
Column mask: fast and deterministic
-- Deterministically pseudonymize patient_id with a version tag for rotation.
CREATE OR REPLACE FUNCTION mask_patient_id_fast(patient_id STRING)
RETURNS STRING
DETERMINISTIC
RETURN CONCAT('REF_', SHA2(CONCAT(patient_id, ':v1'), 256));
Why this works:
- Simple CASE statement
- No external dependencies
- Deterministic results for consistent joins
- Keep principal logic out of the UDF
Column mask: partial reveal without regex hotspots
-- Reveal only the last 4 digits of an SSN, masking the rest.
CREATE OR REPLACE FUNCTION mask_ssn_last4(ssn STRING)
RETURNS STRING
DETERMINISTIC
RETURN CASE
WHEN ssn IS NULL THEN NULL
WHEN LENGTH(ssn) >= 4 THEN CONCAT('XXX-XX-', RIGHT(REGEXP_REPLACE(ssn, '[^0-9]', ''), 4))
ELSE 'MASKED'
END;
Why this works:
- Uses a single lightweight regex to strip non-digits
- Avoids multiple regex passes on large text fields
Column mask: deterministic pseudonymization with versioning
-- Create a consistent pseudonymized reference ID.
CREATE OR REPLACE FUNCTION mask_id_deterministic(id STRING)
RETURNS STRING
DETERMINISTIC
RETURN CONCAT('REF_', SHA2(CONCAT(id, ':v1'), 256));
Why this works:
- Preserves joins across masked datasets
- Includes a version tag (:v1) to support key rotation without breaking historical data intentionally
Row filter: pushdown-friendly by region
-- Returns TRUE if the row's state is in the allowed set.
CREATE OR REPLACE FUNCTION filter_by_region(state STRING, allowed ARRAY<STRING>)
RETURNS BOOLEAN
DETERMINISTIC
RETURN array_contains(TRANSFORM(allowed, x -> lower(x)), lower(state));
Why this works:
- Simple boolean logic
- References only table columns
- Enables predicate pushdown and vectorization
Row filter: deterministic multi-condition
-- Returns TRUE if the row's region is in the allowed set.
CREATE OR REPLACE FUNCTION filter_region_in(region STRING, allowed_regions ARRAY<STRING>)
RETURNS BOOLEAN
DETERMINISTIC
RETURN array_contains(TRANSFORM(allowed_regions, x -> lower(x)), lower(region));
Why this works:
- Supports multiple roles and geographies in one function
- Keeps logic flat for better optimization
Testing UDF performance
Use synthetic scale tests to validate behavior and performance before production. For example:
WITH test_data AS (
SELECT
patient_id,
your_mask_function(patient_id) as masked_id,
current_timestamp() as start_time
FROM (
SELECT CONCAT('PAT', LPAD(seq, 6, '0')) as patient_id
FROM range(1000000) -- 1 million test rows
)
)
SELECT
COUNT(*) as rows_processed,
MAX(start_time) - MIN(start_time) as total_duration,
COUNT(*) / EXTRACT(EPOCH FROM (MAX(start_time) - MIN(start_time))) as rows_per_second
FROM test_data;
ABAC column mask type casting
Databricks automatically casts the output of column mask functions resolved from ABAC policies to match the target column's data type. This ensures type consistency and reliable query behavior when masking columns.
How automatic casting works
- Policy evaluation: The ABAC policy determines if masking should be applied.
- Mask function execution: If masking is required, the mask function executes.
- Automatic casting: The mask function result is automatically cast to the target column's data type.
- Result return: The properly typed result is returned to the query.
All casting behavior follows ANSI SQL standards for CAST operations. For full compatibility details, see Returns.
Best practices for type-safe masking
Follow these patterns to ensure your mask functions work reliably with automatic type casting and avoid runtime failures.
Design type-consistent mask functions
Ensure mask functions return types compatible with target columns:
-- Successful: Returns same type as target column
CREATE FUNCTION safe_salary_mask(salary DOUBLE, user_role STRING)
RETURNS DOUBLE
RETURN CASE
WHEN user_role IN ('admin', 'hr') THEN salary
WHEN user_role = 'manager' THEN ROUND(salary / 1000) * 1000
ELSE 0.0
END;
-- Unsuccessful: Returns different type that might not cast
CREATE FUNCTION risky_salary_mask(salary DOUBLE, user_role STRING)
RETURNS STRING
RETURN CASE
WHEN user_role IN ('admin', 'hr') THEN CAST(salary AS STRING)
ELSE 'CONFIDENTIAL' -- This will fail casting to DOUBLE
END;
Use VARIANT for flexible masking
For complex masking scenarios with varying output types, use VARIANT to accommodate different data types:
CREATE FUNCTION flexible_mask(data VARIANT)
RETURNS VARIANT
RETURN CASE
WHEN schema_of_variant(data) = 'INT' THEN 0::VARIANT
WHEN schema_of_variant(data) = 'DATE' THEN DATE'1970-01-01'::VARIANT
WHEN schema_of_variant(data) = 'DOUBLE' THEN 0.00::VARIANT
ELSE NULL::VARIANT
END;
Test with diverse data scenarios
Test mask functions with different data patterns and user contexts before production deployment:
-- Test different access levels using different users/groups
SELECT * FROM sensitive_table;
-- Test casting behavior for different regional scenarios
SELECT CAST(mask_transaction(transaction_info, 'US') AS STRUCT<id INT, amount DOUBLE>);
SELECT CAST(mask_transaction(transaction_info, 'JP') AS STRUCT<id INT, amount DOUBLE>);
SELECT CAST(mask_transaction(transaction_info, 'CN') AS STRUCT<id INT, amount DOUBLE>);
Limitations
For limitations on row filers and column masks, see Limitations.