Skip to main content

Redact PII from OpenTelemetry traces in Unity Catalog

OpenTelemetry (OTel) trace data often contains personally identifiable information (PII) such as email addresses, phone numbers, and credit card numbers embedded in span attributes, log bodies, and resource metadata. Sharing this trace data broadly for debugging or observability can create compliance and privacy risks.

This page describes an example solution that uses AI Functions and Lakeflow Spark Declarative Pipelines to incrementally redact PII from raw OTel tables and write the results to a separate set of tables with broader access controls. A configurable retention job handles cleanup of the raw data. Deploy the downloadable assets into your own workspace and adapt them to your requirements.

You can use this solution with any OTel traces stored in Unity Catalog, including Store OpenTelemetry traces in Unity Catalog.

How it works

OTel PII redaction overview

A Lakeflow Spark Declarative Pipelines pipeline incrementally reads new OTel spans, applies ai_mask to redact PII (emails, phones, SSNs, credit cards, names, and addresses), and writes to the redacted tables. A scheduled job handles optional retention cleanup on the raw tables.

Prerequisites

Download the assets

Download the following files and import them into your workspace:

File

Description

deploy_notebook.py

Guided deployment notebook — interactive alternative to deploy.sh.

deploy.sh

CLI deployment script.

pii_redaction_pipeline.sql

The pipeline — streaming tables with ai_mask.

unified_view.sql

Unified trace view joining spans and annotations.

setup_schema_and_grants.sql

Schema creation and access control grants.

pipeline_config.json

Example pipeline configuration (reference).

send_pii_traces.py

Test utility that sends PII test data as OTel spans.

pii_test_data.jsonl

50 lines of synthetic PII test data.

For more details, see the reference documentation.

Deploy the solution

Select one of the following deployment methods.

For a step-by-step deployment directly in your workspace:

  1. Import deploy_notebook.py into your workspace, along with the other downloaded assets. See Databricks Git folders.
  2. Open deploy_notebook.py in your workspace.
  3. Fill in the widget parameters at the top (catalog, source schema, target schema, and table prefix).
  4. Click Run all. Each step validates before proceeding.

This approach uses the Databricks Python SDK (no CLI required), is safe to re-run, and provides interactive feedback at each step.

Parameters

The following table describes the widget parameters in the guided deployment notebook (deploy_notebook.py):

Parameter

Description

Default

catalog

Unity Catalog catalog for both the raw and redacted tables.

(required)

source_schema

Schema containing the raw OTel tables.

(required)

target_schema

Schema for the redacted output tables.

(required)

table_prefix

Prefix used for the OTel table names.

(required)

pii_categories

PII types to redact, comma-separated and single-quoted.

'email','phone','ssn','credit_card','name','address'

pipeline_name

Name for the pipeline.

otel-pii-redaction

retention_days

Days to retain raw data before deletion. A blank value, 0, or none disables deletion.

90

redaction_pipeline_mode

Pipeline execution mode: triggered or continuous.

triggered

redaction_trigger_frequency

How often the pipeline runs (triggered mode only): hourly, every 6 hours, daily, or weekly.

daily

Source tables are named {catalog}.{source_schema}.{table_prefix}_otel_spans, {catalog}.{source_schema}.{table_prefix}_otel_logs, and {catalog}.{source_schema}.{table_prefix}_otel_annotations.

The pipeline supports two execution modes:

  • triggered: Creates a scheduled job that triggers the pipeline on the chosen frequency. The pipeline processes new data on each run and then stops.
  • continuous: Runs the pipeline continuously, processing new data as it arrives. No scheduling job is created. This mode has higher compute costs than triggered mode because the pipeline is always running.

What gets redacted

The pipeline applies ai_mask to the following fields:

Table

Fields redacted

Spans

attributes, events, resource.attributes

Logs

body, attributes, resource.attributes

Annotations

Passthrough (no PII expected)

The pipeline preserves non-PII fields unchanged, such as trace IDs, span IDs, timestamps, service names, and status codes.

Supported PII categories

ai_mask is LLM-backed and recognizes standard PII types, including email, phone, name, address, ssn, credit_card, ip_address, and date_of_birth.

ai_mask is recommended because it handles varied PII formats (for example, phone numbers written as (555) 123-4567, 555.123.4567, or +1 555-123-4567) without requiring a separate pattern for each variation. You can adapt the pipeline to use a different redaction method, such as explicit regular expressions with regexp_replace.

For custom patterns, such as employee IDs like EMP-XXXXXX, use regexp_replace before ai_mask in the pipeline SQL. For details, see PII redaction from OTel traces reference.

Retention and access control

Raw data retention

The deployment configures auto time-to-live on the raw OTel tables to automatically delete trace data older than a configurable number of days (default: 90). This helps you comply with GDPR and other data protection regulations that require personal data to be deleted after it is no longer needed for its original purpose. After the pipeline processes the raw spans, auto-TTL removes the originals that contain PII according to your retention policy. Set retention_days to 0 or none to disable automatic deletion if you manage retention separately. If your compliance requirements demand strict deletion timelines, you can set up a manual scheduled job with DELETE and VACUUM instead, as exact auto-TTL deletion timing is not guaranteed.

Limit access to raw tables

The raw OTel tables contain unredacted PII and should have restricted access. Grant access to the raw source schema only to pipeline service principals and administrators who need it for debugging or incident response. All routine analytics, dashboards, and observability workflows should query the redacted tables instead. The setup_schema_and_grants.sql file includes example grants to help enforce this separation. For more information about Unity Catalog privileges, see Manage privileges in Unity Catalog.

Test the redaction

Send test PII data

Generate test spans with known PII to validate redaction:

Bash
pip install opentelemetry-exporter-otlp-proto-http

python send_pii_traces.py <WORKSPACE_HOST> <CATALOG.SCHEMA.PREFIX_otel_spans>

This sends 50 test traces that contain emails, phones, SSNs, credit cards, names, and addresses.

Validate the output

After running the pipeline, compare the raw and redacted spans:

SQL
SELECT
s.span_id,
CAST(s.attributes AS STRING) AS raw,
CAST(r.attributes AS STRING) AS redacted
FROM <source_catalog>.<source_schema>.<prefix>_otel_spans s
JOIN <target_catalog>.<target_schema>.redacted_spans r
ON s.trace_id = r.trace_id AND s.span_id = r.span_id
WHERE s.name = 'pii-test-interaction'
LIMIT 5;

Next steps