Skip to main content

PII redaction from OTel traces reference

This page describes a reference architecture for redacting PII from OpenTelemetry (OTel) spans stored in Unity Catalog. It covers two complementary flows: server-side batch processing and view-based on-read redaction. Both flows use AI Functions and Lakeflow Spark Declarative Pipelines. For deployment instructions and downloadable assets, see Redact PII from OpenTelemetry traces in Unity Catalog.

Parameters

All components in this solution are parameterized for reuse across environments.

Table parameters

Parameter

Description

Example

source_catalog

Unity Catalog catalog containing the raw OTel tables.

ml_observability

source_schema

Unity Catalog schema containing the raw OTel tables.

traces_raw

table_prefix

Prefix used when configuring OTel trace storage.

mlflow

target_catalog

Unity Catalog catalog for the redacted output tables.

ml_observability

target_schema

Unity Catalog schema for the redacted output tables.

traces_redacted

retention_days

TTL for unredacted data (GDPR compliance). Set to 0 to disable automatic deletion.

90

Derived source table names:

  • {source_catalog}.{source_schema}.{table_prefix}_otel_spans
  • {source_catalog}.{source_schema}.{table_prefix}_otel_logs
  • {source_catalog}.{source_schema}.{table_prefix}_otel_annotations

PII redaction rules

Parameter

Description

Example

pii_categories

List of PII types to redact.

["email", "phone", "ssn", "credit_card", "ip_address", "name", "address"]

redaction_mode

How to mask PII: mask, hash, or remove.

mask

mask_character

Character used for masking.

*

fields_to_redact

OTel fields to apply redaction to.

["attributes", "resource.attributes", "events"]

allowlisted_keys

Attribute keys to skip redaction on (for example, service.name).

["service.name", "http.method", "http.status_code"]

custom_patterns

Regex patterns for domain-specific PII.

{"employee_id": "EMP-\\d{6}", "internal_account": "ACCT-[A-Z0-9]+"}

This flow uses Lakeflow Spark Declarative Pipelines to materialize redacted tables from the raw OTel tables.

Why use a declarative pipeline

Consideration

Lakeflow Spark Declarative Pipelines (streaming table)

Notebook and scheduled job

Incremental processing

Built-in — streaming tables only process new rows.

Manual checkpoint management with Structured Streaming.

AI function support

Built-in in SQL.

Built-in in SQL.

Monitoring and alerting

Built-in pipeline UI and event log.

Must configure separately.

Retry and failure handling

Automatic.

Manual.

Lineage

Automatic Unity Catalog lineage tracking.

Manual.

Serverless compute

Yes.

Yes (with serverless jobs).

Ops overhead

Low — fully managed.

Medium — manage state, schedules, and alerting.

Lakeflow Spark Declarative Pipelines with streaming tables is the best fit. OTel spans are append-only, making them ideal for incremental streaming table ingestion. AI functions such as ai_mask are built-in in SQL, so a SQL pipeline is the simplest implementation.

Architecture

OTel PII redaction architecture

Implementation

Step 1: Lock down the raw tables

Grant access to the raw tables only to the pipeline service principal and administrator users.

SQL
-- Restrict raw table access
GRANT USE CATALOG ON CATALOG ${source_catalog} TO `pii_pipeline_sp`;
GRANT USE SCHEMA ON SCHEMA ${source_catalog}.${source_schema} TO `pii_pipeline_sp`;
GRANT SELECT ON TABLE ${source_catalog}.${source_schema}.${table_prefix}_otel_spans TO `pii_pipeline_sp`;
GRANT SELECT ON TABLE ${source_catalog}.${source_schema}.${table_prefix}_otel_logs TO `pii_pipeline_sp`;

-- Revoke broader access
REVOKE SELECT ON TABLE ${source_catalog}.${source_schema}.${table_prefix}_otel_spans FROM `data_team`;

Step 2: Create the pipeline SQL

Define the redacted streaming tables in pii_redaction_pipeline.sql.

SQL
-- =============================================================
-- Streaming Table: Redacted Spans
-- =============================================================
CREATE OR REFRESH STREAMING TABLE redacted_spans
COMMENT 'PII-redacted OTel spans'
TBLPROPERTIES (
'quality' = 'gold',
'pipelines.autoOptimize.zOrderCols' = 'trace_id,date'
)
AS
SELECT
trace_id,
span_id,
parent_span_id,
name,
kind,
start_time,
end_time,
status,
date,
record_id,
service_name,
time,
instrumentation_scope,

-- Redact span attributes (VARIANT field)
-- ai_mask replaces PII with masked values in free-text content
CASE
WHEN attributes IS NOT NULL THEN
ai_mask(
CAST(attributes AS STRING),
array(${pii_categories}) -- e.g. array('email','phone','ssn','name','address')
)
ELSE attributes
END AS attributes,

-- Redact resource attributes
CASE
WHEN resource:attributes IS NOT NULL THEN
named_struct(
'attributes',
ai_mask(
CAST(resource:attributes AS STRING),
array(${pii_categories})
),
'dropped_attributes_count',
resource:dropped_attributes_count
)
ELSE resource
END AS resource,

-- Redact events (may contain exception messages with PII)
CASE
WHEN events IS NOT NULL THEN
ai_mask(
CAST(events AS STRING),
array(${pii_categories})
)
ELSE events
END AS events,

-- Pass through links unchanged (typically just trace/span IDs)
links

FROM STREAM(${source_catalog}.${source_schema}.${table_prefix}_otel_spans);


-- =============================================================
-- Streaming Table: Redacted Logs
-- =============================================================
CREATE OR REFRESH STREAMING TABLE redacted_logs
COMMENT 'PII-redacted OTel logs'
AS
SELECT
trace_id,
span_id,
severity_number,
severity_text,
date,
record_id,
service_name,
time,
instrumentation_scope,

-- Redact log body
CASE
WHEN body IS NOT NULL THEN
ai_mask(
CAST(body AS STRING),
array(${pii_categories})
)
ELSE body
END AS body,

-- Redact log attributes
CASE
WHEN attributes IS NOT NULL THEN
ai_mask(
CAST(attributes AS STRING),
array(${pii_categories})
)
ELSE attributes
END AS attributes,

-- Redact resource attributes
CASE
WHEN resource:attributes IS NOT NULL THEN
named_struct(
'attributes',
ai_mask(
CAST(resource:attributes AS STRING),
array(${pii_categories})
),
'dropped_attributes_count',
resource:dropped_attributes_count
)
ELSE resource
END AS resource

FROM STREAM(${source_catalog}.${source_schema}.${table_prefix}_otel_logs);


-- =============================================================
-- Streaming Table: Annotations (passthrough — no PII expected)
-- =============================================================
CREATE OR REFRESH STREAMING TABLE redacted_annotations
COMMENT 'OTel annotations (passthrough, no PII redaction applied)'
AS
SELECT *
FROM STREAM(${source_catalog}.${source_schema}.${table_prefix}_otel_annotations);

Step 3: Create the pipeline resource

Use the following configuration as a template.

JSON
{
"name": "otel-pii-redaction",
"catalog": "${target_catalog}",
"schema": "${target_schema}",
"serverless": true,
"continuous": false,
"channel": "CURRENT",
"configuration": {
"source_catalog": "<value>",
"source_schema": "<value>",
"table_prefix": "<value>",
"pii_categories": "'email','phone','ssn','credit_card','name','address'"
},
"libraries": [{ "file": { "path": "/Workspace/path/to/pii_redaction_pipeline.sql" } }]
}

Run the pipeline in triggered mode (for example, every 15 minutes or hourly) depending on your latency requirements. Continuous mode is also an option, but it increases cost.

Step 4: Create the unified view on the redacted tables

SQL
-- Recreate the trace_unified view pointing at redacted tables
CREATE OR REPLACE VIEW ${target_catalog}.${target_schema}.${table_prefix}_trace_unified AS
SELECT
s.trace_id,
s.date,
min(s.start_time) AS request_time,
max(s.end_time) - min(s.start_time) AS execution_duration,
collect_list(
named_struct(
'span_id', s.span_id,
'parent_span_id', s.parent_span_id,
'name', s.name,
'kind', s.kind,
'start_time', s.start_time,
'end_time', s.end_time,
'status', s.status,
'attributes', s.attributes,
'events', s.events
)
) AS spans,
a.tags,
a.assessments
FROM ${target_catalog}.${target_schema}.redacted_spans s
LEFT JOIN ${target_catalog}.${target_schema}.redacted_annotations a
ON s.trace_id = a.target_id
GROUP BY s.trace_id, s.date, a.tags, a.assessments;

Step 5: Grant broader access to the redacted tables

SQL
GRANT USE CATALOG ON CATALOG ${target_catalog} TO `data_team`;
GRANT USE SCHEMA ON SCHEMA ${target_catalog}.${target_schema} TO `data_team`;
GRANT SELECT ON SCHEMA ${target_catalog}.${target_schema} TO `data_team`;

Step 6: Set up retention on the raw tables (optional, for GDPR compliance)

If retention_days is configured (greater than 0), use auto time-to-live to automatically delete expired rows. The OTel trace tables are Unity Catalog managed Delta tables with time TIMESTAMP columns, so auto-TTL is supported. Predictive optimization must be enabled on the workspace (or table).

SQL
ALTER TABLE ${source_catalog}.${source_schema}.${table_prefix}_otel_spans
DELETE ROWS ${retention_days} DAYS AFTER time;

ALTER TABLE ${source_catalog}.${source_schema}.${table_prefix}_otel_logs
DELETE ROWS ${retention_days} DAYS AFTER time;

Databricks runs DELETE, PURGE, and VACUUM operations in the background automatically — no scheduled job is required.

note

Exact deletion timing is not guaranteed. There can be a buffer of up to 6 days between row expiration and permanent deletion, plus the data retention duration (default 7 days). If your compliance requirements demand strict deletion timelines, use a scheduled job with manual DELETE and VACUUM as a fallback. See auto time-to-live for details on calculating configuration values for a target expiration period.

Flow 2: View-based redaction (no data duplication)

This flow applies ai_mask in a Unity Catalog view, so redaction happens at read time and no redacted copy is stored.

When to use

  • Storage cost is a primary concern.
  • Redacted data is queried infrequently.
  • It is acceptable to pay the compute cost on every query.

Architecture

OTel PII redaction architecture

Implementation

SQL
CREATE OR REPLACE VIEW ${target_catalog}.${target_schema}.${table_prefix}_otel_spans_redacted
AS
SELECT
trace_id,
span_id,
parent_span_id,
name,
kind,
start_time,
end_time,
status,
date,
service_name,
time,
instrumentation_scope,
links,

ai_mask(
CAST(attributes AS STRING),
array(${pii_categories})
) AS attributes,

ai_mask(
CAST(events AS STRING),
array(${pii_categories})
) AS events,

named_struct(
'attributes',
ai_mask(CAST(resource:attributes AS STRING), array(${pii_categories})),
'dropped_attributes_count',
resource:dropped_attributes_count
) AS resource

FROM ${source_catalog}.${source_schema}.${table_prefix}_otel_spans;

Trade-offs

Aspect

Advantages

Disadvantages

Storage

No duplication.

Compute

ai_mask is called on every query, which is expensive at scale.

Latency

Immediately reflects new data.

Slower query response.

Flexibility

Redaction rules update instantly.

Flow comparison

Dimension

Flow 1: batch pipeline

Flow 2: view-based

Data fidelity preserved

Yes (raw table retained).

Yes (raw table retained).

Storage cost

2x (time-windowed, or ~1x if auto-TTL applies).

1x.

Compute cost

One-time per record.

Per query.

Query performance

Fast (pre-materialized).

Slow (recomputes).

Latency to availability

Minutes (pipeline interval).

Immediate.

Rule change rollout

Pipeline refresh.

Instant.

GDPR compliance

Auto-TTL or scheduled cleanup on raw tables.

Auto-TTL or scheduled cleanup on raw tables.

Best for

Primary production use.

Low-query-volume use.

Databricks feature

Lakeflow Spark Declarative Pipelines.

Unity Catalog view and AI Functions.

Use Flow 1 (batch pipeline) as the primary solution for most enterprise deployments:

  • Preserves full-fidelity data for authorized debugging.
  • Optimizes query performance through materialization.
  • Supports GDPR compliance with auto-TTL retention on raw data.
  • Handles both PII redaction and trace filtering in one pipeline.
  • Is fully managed with built-in monitoring and alerting.

Use Flow 2 (view-based) as a lightweight option for low-query-volume scenarios, or as a quick interim solution while you set up Flow 1.

Prerequisites

  • AI Functions enabled — requires a SQL warehouse or serverless compute with AI Functions access.
  • Unity Catalog — OTel traces must be stored in Unity Catalog tables with MLflow trace-to-Unity Catalog binding configured. See Store OpenTelemetry traces in Unity Catalog.
  • Service principal — for pipeline execution, with appropriate grants on the source tables.
  • Foundation model endpointai_mask uses a foundation model. Verify that the endpoint is available and sized for throughput.

Implementation checklist

  • Validate ai_mask behavior on VARIANT columns with sample OTel span data.
  • Benchmark ai_mask throughput to size the pipeline schedule interval.
  • Define the allowlisted attribute keys that should skip redaction.
  • Set up access control groups (raw access versus redacted access).
  • Configure auto-TTL for raw table retention (or a scheduled DELETE and VACUUM job if strict deletion timing is required).
  • Build a monitoring dashboard for pipeline health and redaction coverage.