Skip to main content

User-defined operator YAML reference

Preview

This feature is in Public Preview.

This page describes the YAML configuration for user-defined operators in Lakeflow Designer. All operator types (uc-udf, uc-udtf, and python-run-function) use the user-defined-operator-v0.1.0 schema, which defines configuration fields using the JSON Schema format.

For information about how to build user-defined operators, see User-defined operators in Lakeflow Designer.

Root properties

Every operator YAML file starts with a set of root properties that identify the operator and define its behavior. The following example shows the general structure:

YAML
schema: user-defined-operator-v0.1.0
type: python-run-function
name: My Operator
id: my_operator
version: '1.0.0'
description: >
What this operator does.
Can be multiple lines.
config:
type: object
properties:
my_field:
type: string
title: My Field
description: Help text
ports:
input:
- name: data
title: Input Data
output:
- name: out
title: Output
run_function:
type: inline
code: |
def run(config, inputs, spark):
return {"out": inputs["data"]}
environment:
environment_version: '4'
dependencies:
- 'pandas>=2.0'

Property

Type

Required

Description

schema

string

Yes

Schema identifier. Must be user-defined-operator-v0.1.0.

type

string

Yes

Type of operator: uc-udf, uc-udtf, or python-run-function.

name

string

Yes

Display name for the operator. Keep it short to fit the Lakeflow Designer UI. Minimum length of 1 character.

id

string

Yes

Unique identifier for the operator type. Minimum length of 1 character. Consider using namespaces (such as finance. or ml.) to categorize operators.

description

string

Yes

Detailed description of what the operator does. Shown to users in the UI. Use YAML multi-line syntax (>) for longer descriptions.

config

object

Yes

JSON Schema object that defines configuration fields. See Config.

ports

object

No

Input and output port definitions. See Ports.

version

string

Yes

Version string (for example, "1.0.0"). Use this to track your own operator releases.

run_function

object

No

Inline Python code for python-run-function operators. See run_function.

environment

object

No

Python environment configuration, including dependencies. See environment.

Ports

Ports define how your operator connects to other operators in the pipeline. The ports object contains input and output arrays.

YAML
ports:
input:
- name: input_data
title: Input Data
mime: application/vnd.databricks.dataframe
allowMultiple: true
required: true
output:
- name: out
title: Output

Property

Type

Required

Description

name

string

Yes

Unique identifier for the port. Used in connections and config references.

title

string

No

Human-readable label displayed in the UI.

mime

string

No

MIME type for the port data. For example, application/vnd.databricks.dataframe.

allowMultiple

boolean

No

If true, the port accepts multiple incoming connections.

required

boolean

No

If false, the port is optional. Default: true.

Only the documented port properties are accepted. Unknown keys (such as the legacy label field) are rejected by schema validation.

Port examples

UDF with input and output ports:

YAML
ports:
input:
- name: in
title: Input Data
output:
- name: out
title: Output

UDTF with input and output ports:

YAML
ports:
input:
- name: input_data
title: Input Data
output:
- name: clustered_data
title: Clustered Results

python-run-function with multiple inputs and an optional port:

YAML
ports:
input:
- name: main_data
title: Main Data
- name: reference_data
title: Reference Table
required: false
output:
- name: joined_output
title: Joined Output

Config

The config field is a JSON Schema object. You define each configuration field as a property within the schema. This format gives you access to standard JSON Schema validation features like enum, minimum, maximum, and examples.

The config object must have type: object and a properties map. You can optionally include required (an array of required property names) and additionalProperties.

YAML
config:
type: object
properties:
cluster_count:
type: number
title: Number of Clusters
description: How many clusters to create
default: 3
minimum: 1
maximum: 100
algorithm:
type: string
title: Algorithm
description: Clustering algorithm to use
enum: ['kmeans', 'dbscan', 'hierarchical']
default: kmeans
feature_col:
type: string
title: Feature Column
description: Column to use as input
format: expression
x-ui:
widget: expression
port: data
required: [cluster_count, feature_col]
additionalProperties: false

Config property fields

Each property in the config.properties object supports the following standard JSON Schema fields:

Field

Type

Description

type

string

Data type: string, number, integer, boolean, array, or object.

title

string

Human-readable label displayed in the UI.

description

string

Help text shown to users.

default

any

Default value for the field.

examples

array

Example values for the field.

enum

array

Fixed list of allowed values.

format

string

Semantic type hint. See Format values.

minimum

number

Minimum allowed value (for number and integer types).

maximum

number

Maximum allowed value (for number and integer types).

items

object

Schema for array elements (when type is array).

properties

object

Nested property definitions (when type is object).

required

array

List of required nested property names (when type is object).

Other standard JSON Schema fields such as minLength, maxLength, pattern, and const are also supported.

Format values

The format field on a config property provides a semantic type hint that tells Lakeflow Designer how to interpret the value. These hints enable specialized UI behavior and validation.

Format

Description

expression

Column reference or SQL expression.

table_source

Table source reference.

file_source

File source reference.

column_expressions

Column expressions.

sort_expressions

Sort expressions.

aggregation_expressions

Aggregation expressions.

ai_function_expressions

AI function expressions.

is_preview

Automatic preview mode flag. Lakeflow Designer sets this to true during workflow preview. The config property name is arbitrary — only the format: is_preview tag matters. Use this to skip side effects like external API calls during preview.

string[]

String array.

UI widgets

Widgets customize how a config field renders in the Lakeflow Designer interface. Define widgets in the x-ui property on each config property. If you omit the widget, Lakeflow Designer uses a default widget based on the data type.

Widget

Data type

Description

input

string

Single-line text input.

textarea

string

Multi-line text area. Supports optional rows property.

checkbox

boolean

Standard checkbox.

toggle

boolean

Toggle switch.

number

number/integer

Numeric input with optional constraints.

slider

number/integer

Visual slider for numeric ranges. Supports optional step property.

select

string

Single-select dropdown. Requires optionsSource.

multi-select

array

Multi-select dropdown. Requires optionsSource.

expression

string

Column/expression selector. Requires port.

input

Single-line text input field.

YAML
api_endpoint:
type: string
title: API Endpoint
x-ui:
widget: input

textarea

Multi-line text area for longer content. Supports an optional rows property to control the height.

YAML
message_body:
type: string
title: Message Body
x-ui:
widget: textarea
rows: 4

checkbox

Standard checkbox for boolean values.

YAML
send_notification:
type: boolean
title: Send Notification
default: false
x-ui:
widget: checkbox

toggle

Toggle switch for boolean values.

YAML
enable_logging:
type: boolean
title: Enable Logging
default: true
x-ui:
widget: toggle

number

Numeric input field. Use minimum and maximum on the property itself to constrain the range.

YAML
num_clusters:
type: number
title: Number of Clusters
default: 3
minimum: 1
maximum: 100
x-ui:
widget: number

slider

Visual slider for selecting numeric values within a range. Use minimum and maximum on the property to set the range, and step in x-ui to control the increment.

YAML
confidence_threshold:
type: number
title: Confidence Threshold
default: 0.8
minimum: 0
maximum: 1
x-ui:
widget: slider
step: 0.05

select

Single-select dropdown. Requires an optionsSource to define where the dropdown values come from. See Options sources.

YAML
aggregation_type:
type: string
title: Aggregation Type
x-ui:
widget: select
optionsSource:
type: static
values: ['sum', 'avg', 'min', 'max', 'count']

multi-select

Multi-select dropdown for choosing multiple values. Use type: array with items: { type: string } on the property. Requires an optionsSource. See Options sources.

YAML
feature_columns:
type: array
title: Feature Columns
items:
type: string
x-ui:
widget: multi-select
optionsSource:
type: inputColumns
port: input_data

expression

Column/expression selector that lets users pick a column from input data or write a custom SQL expression. Set format: expression on the property and specify the input port in x-ui. This is useful:

  • When the user should select a column from the input data.
  • When the user might want to write a custom SQL expression.
  • For parameters that reference dynamic data in the pipeline.
YAML
amount:
type: string
title: Amount
format: expression
x-ui:
widget: expression
port: input_data

Options sources

For select and multi-select widgets, you must define where the dropdown options come from using optionsSource.

Static options

A fixed list of values defined in the YAML.

YAML
optionsSource:
type: static
values: ['option1', 'option2', 'option3']

Property

Type

Required

Description

type

string

Yes

Must be static.

values

array

Yes

Array of string values for the dropdown.

Input columns

Dynamically populates the dropdown with column names from an input port.

YAML
optionsSource:
type: inputColumns
port: input_data

Property

Type

Required

Description

type

string

Yes

Must be inputColumns.

port

string

Yes

Name of the input port to get column names from. Must match the name of one of your defined input ports.

run_function

The run_function property lets you embed Python code directly in the YAML configuration for python-run-function operators. This eliminates the need to register a separate Unity Catalog function.

YAML
run_function:
type: inline
code: |
def run(config, inputs, spark):
df = inputs["data"]
threshold = config["threshold"]
return {"out": df.filter(df["score"] > threshold)}

Property

Type

Required

Description

type

string

Yes

Must be inline.

code

string

Yes

Python source code. Must define a run() function.

The run() function receives three arguments:

  • config: A dictionary of configuration values set by the user in the UI.
  • inputs: A dictionary mapping input port names to DataFrames.
  • spark: The active SparkSession.

The function must return a dictionary mapping output port names to DataFrames. The keys must exactly match the name field of each output port defined in ports.output. For example, with an output port named out:

Python
return {"out": result_df}

With multiple output ports:

Python
return {"match": match_df, "rest": rest_df}

environment

The environment property specifies the Python environment for python-run-function operators. Use it to pin the environment version and declare pip dependencies.

YAML
environment:
environment_version: '4'
dependencies:
- 'scikit-learn>=1.3'
- 'pandas>=2.0'

Property

Type

Required

Description

environment_version

string

No

The environment version to use. For example, "4".

dependencies

array of strings

No

List of pip dependency specifiers. Each entry follows standard pip syntax (for example, "pandas>=2.0").

Complete examples

UC-based UDF

This example defines a Unity Catalog-based UDF operator that calculates compound interest.

YAML
schema: user-defined-operator-v0.1.0
type: uc-udf
name: Compound Interest
id: finance.compound_interest
version: '1.0.0'
description: >
Calculates compound interest based on principal, rate, and time period.

config:
type: object
properties:
principal:
type: string
title: Principal Amount
format: expression
x-ui:
widget: expression
port: input_data

annual_rate:
type: number
title: Annual Interest Rate
default: 5.0
minimum: 0
maximum: 100
x-ui:
widget: number

years:
type: number
title: Number of Years
default: 10
minimum: 1
maximum: 50
x-ui:
widget: slider
step: 1

compound_frequency:
type: string
title: Compounding Frequency
default: 'monthly'
x-ui:
widget: select
optionsSource:
type: static
values: ['daily', 'monthly', 'quarterly', 'annually']
required: [principal, annual_rate]
additionalProperties: false

ports:
input:
- name: input_data
title: Input Data
output:
- name: out
title: Output

Python run-function operator

This example defines a python-run-function operator that segments customers using K-Means clustering.

YAML
schema: user-defined-operator-v0.1.0
type: python-run-function
name: Customer Segmentation
id: ml.customer_segmentation
version: '1.2.0'
description: >
Segments customers into groups based on selected features
using K-Means clustering. Returns customer IDs with their
assigned segment numbers.

config:
type: object
properties:
num_segments:
type: integer
title: Number of Segments
description: How many customer segments to create
default: 3
minimum: 2
maximum: 20
x-ui:
widget: number
customer_id_column:
type: string
title: Customer ID Column
description: Column containing customer identifiers
x-ui:
widget: select
optionsSource:
type: inputColumns
port: customer_data
feature_columns:
type: array
title: Feature Columns
description: Columns to use for segmentation
items:
type: string
x-ui:
widget: multi-select
optionsSource:
type: inputColumns
port: customer_data
normalize_features:
type: boolean
title: Normalize Features
description: Whether to normalize feature values before clustering
default: true
x-ui:
widget: toggle
required: [num_segments, customer_id_column, feature_columns]
additionalProperties: false

ports:
input:
- name: customer_data
title: Customer Data
mime: application/vnd.databricks.dataframe
output:
- name: segmented_customers
title: Segmented Customers

run_function:
type: inline
code: |
def run(config, inputs, spark):
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans

df = inputs["customer_data"]
id_col = config["customer_id_column"]
features = config["feature_columns"]
k = config["num_segments"]
normalize = config.get("normalize_features", True)

assembler = VectorAssembler(inputCols=features, outputCol="features_vec")
assembled = assembler.transform(df)

if normalize:
scaler = StandardScaler(inputCol="features_vec", outputCol="scaled_features")
model = scaler.fit(assembled)
assembled = model.transform(assembled)
feature_col = "scaled_features"
else:
feature_col = "features_vec"

kmeans = KMeans(k=k, featuresCol=feature_col, predictionCol="segment")
result = kmeans.fit(assembled).transform(assembled)

return {"segmented_customers": result.select(id_col, "segment")}

environment:
environment_version: '4'
dependencies:
- 'scikit-learn>=1.3'

Quick reference

Required root properties

  • schema: user-defined-operator-v0.1.0
  • name: Display name
  • id: Unique identifier
  • description: What the operator does
  • config: JSON Schema object
  • type: uc-udf, uc-udtf, or python-run-function
  • version: Author-defined version string

Optional root properties

  • ports: Input and output port definitions
  • run_function: Inline Python code (python-run-function only)
  • environment: Python environment and dependencies (python-run-function only)

Config property data types

string | boolean | number | integer | array | object

UI widgets

input | textarea | checkbox | toggle | number | slider | select | multi-select | expression

Options sources

static (fixed values) | inputColumns (from input port)

Format values

expression | table_source | file_source | column_expressions | sort_expressions | aggregation_expressions | ai_function_expressions | is_preview | string[]