User-defined operator YAML reference
This feature is in Public Preview.
This page describes the YAML configuration for user-defined operators in Lakeflow Designer. All operator types (uc-udf, uc-udtf, and python-run-function) use the user-defined-operator-v0.1.0 schema, which defines configuration fields using the JSON Schema format.
For information about how to build user-defined operators, see User-defined operators in Lakeflow Designer.
Root properties
Every operator YAML file starts with a set of root properties that identify the operator and define its behavior. The following example shows the general structure:
schema: user-defined-operator-v0.1.0
type: python-run-function
name: My Operator
id: my_operator
version: '1.0.0'
description: >
What this operator does.
Can be multiple lines.
config:
type: object
properties:
my_field:
type: string
title: My Field
description: Help text
ports:
input:
- name: data
title: Input Data
output:
- name: out
title: Output
run_function:
type: inline
code: |
def run(config, inputs, spark):
return {"out": inputs["data"]}
environment:
environment_version: '4'
dependencies:
- 'pandas>=2.0'
Property | Type | Required | Description |
|---|---|---|---|
| string | Yes | Schema identifier. Must be |
| string | Yes | Type of operator: |
| string | Yes | Display name for the operator. Keep it short to fit the Lakeflow Designer UI. Minimum length of 1 character. |
| string | Yes | Unique identifier for the operator type. Minimum length of 1 character. Consider using namespaces (such as |
| string | Yes | Detailed description of what the operator does. Shown to users in the UI. Use YAML multi-line syntax ( |
| object | Yes | JSON Schema object that defines configuration fields. See Config. |
| object | No | Input and output port definitions. See Ports. |
| string | Yes | Version string (for example, |
| object | No | Inline Python code for |
| object | No | Python environment configuration, including dependencies. See |
Ports
Ports define how your operator connects to other operators in the pipeline. The ports object contains input and output arrays.
ports:
input:
- name: input_data
title: Input Data
mime: application/vnd.databricks.dataframe
allowMultiple: true
required: true
output:
- name: out
title: Output
Property | Type | Required | Description |
|---|---|---|---|
| string | Yes | Unique identifier for the port. Used in connections and config references. |
| string | No | Human-readable label displayed in the UI. |
| string | No | MIME type for the port data. For example, |
| boolean | No | If |
| boolean | No | If |
Only the documented port properties are accepted. Unknown keys (such as the legacy label field) are rejected by schema validation.
Port examples
UDF with input and output ports:
ports:
input:
- name: in
title: Input Data
output:
- name: out
title: Output
UDTF with input and output ports:
ports:
input:
- name: input_data
title: Input Data
output:
- name: clustered_data
title: Clustered Results
python-run-function with multiple inputs and an optional port:
ports:
input:
- name: main_data
title: Main Data
- name: reference_data
title: Reference Table
required: false
output:
- name: joined_output
title: Joined Output
Config
The config field is a JSON Schema object. You define each configuration field as a property within the schema. This format gives you access to standard JSON Schema validation features like enum, minimum, maximum, and examples.
The config object must have type: object and a properties map. You can optionally include required (an array of required property names) and additionalProperties.
config:
type: object
properties:
cluster_count:
type: number
title: Number of Clusters
description: How many clusters to create
default: 3
minimum: 1
maximum: 100
algorithm:
type: string
title: Algorithm
description: Clustering algorithm to use
enum: ['kmeans', 'dbscan', 'hierarchical']
default: kmeans
feature_col:
type: string
title: Feature Column
description: Column to use as input
format: expression
x-ui:
widget: expression
port: data
required: [cluster_count, feature_col]
additionalProperties: false
Config property fields
Each property in the config.properties object supports the following standard JSON Schema fields:
Field | Type | Description |
|---|---|---|
| string | Data type: |
| string | Human-readable label displayed in the UI. |
| string | Help text shown to users. |
| any | Default value for the field. |
| array | Example values for the field. |
| array | Fixed list of allowed values. |
| string | Semantic type hint. See Format values. |
| number | Minimum allowed value (for |
| number | Maximum allowed value (for |
| object | Schema for array elements (when |
| object | Nested property definitions (when |
| array | List of required nested property names (when |
Other standard JSON Schema fields such as minLength, maxLength, pattern, and const are also supported.
Format values
The format field on a config property provides a semantic type hint that tells Lakeflow Designer how to interpret the value. These hints enable specialized UI behavior and validation.
Format | Description |
|---|---|
| Column reference or SQL expression. |
| Table source reference. |
| File source reference. |
| Column expressions. |
| Sort expressions. |
| Aggregation expressions. |
| AI function expressions. |
| Automatic preview mode flag. Lakeflow Designer sets this to |
| String array. |
UI widgets
Widgets customize how a config field renders in the Lakeflow Designer interface. Define widgets in the x-ui property on each config property. If you omit the widget, Lakeflow Designer uses a default widget based on the data type.
Widget | Data type | Description |
|---|---|---|
| string | Single-line text input. |
| string | Multi-line text area. Supports optional |
| boolean | Standard checkbox. |
| boolean | Toggle switch. |
| number/integer | Numeric input with optional constraints. |
| number/integer | Visual slider for numeric ranges. Supports optional |
| string | Single-select dropdown. Requires |
| array | Multi-select dropdown. Requires |
| string | Column/expression selector. Requires |
input
Single-line text input field.
api_endpoint:
type: string
title: API Endpoint
x-ui:
widget: input
textarea
Multi-line text area for longer content. Supports an optional rows property to control the height.
message_body:
type: string
title: Message Body
x-ui:
widget: textarea
rows: 4
checkbox
Standard checkbox for boolean values.
send_notification:
type: boolean
title: Send Notification
default: false
x-ui:
widget: checkbox
toggle
Toggle switch for boolean values.
enable_logging:
type: boolean
title: Enable Logging
default: true
x-ui:
widget: toggle
number
Numeric input field. Use minimum and maximum on the property itself to constrain the range.
num_clusters:
type: number
title: Number of Clusters
default: 3
minimum: 1
maximum: 100
x-ui:
widget: number
slider
Visual slider for selecting numeric values within a range. Use minimum and maximum on the property to set the range, and step in x-ui to control the increment.
confidence_threshold:
type: number
title: Confidence Threshold
default: 0.8
minimum: 0
maximum: 1
x-ui:
widget: slider
step: 0.05
select
Single-select dropdown. Requires an optionsSource to define where the dropdown values come from. See Options sources.
aggregation_type:
type: string
title: Aggregation Type
x-ui:
widget: select
optionsSource:
type: static
values: ['sum', 'avg', 'min', 'max', 'count']
multi-select
Multi-select dropdown for choosing multiple values. Use type: array with items: { type: string } on the property. Requires an optionsSource. See Options sources.
feature_columns:
type: array
title: Feature Columns
items:
type: string
x-ui:
widget: multi-select
optionsSource:
type: inputColumns
port: input_data
expression
Column/expression selector that lets users pick a column from input data or write a custom SQL expression. Set format: expression on the property and specify the input port in x-ui. This is useful:
- When the user should select a column from the input data.
- When the user might want to write a custom SQL expression.
- For parameters that reference dynamic data in the pipeline.
amount:
type: string
title: Amount
format: expression
x-ui:
widget: expression
port: input_data
Options sources
For select and multi-select widgets, you must define where the dropdown options come from using optionsSource.
Static options
A fixed list of values defined in the YAML.
optionsSource:
type: static
values: ['option1', 'option2', 'option3']
Property | Type | Required | Description |
|---|---|---|---|
| string | Yes | Must be |
| array | Yes | Array of string values for the dropdown. |
Input columns
Dynamically populates the dropdown with column names from an input port.
optionsSource:
type: inputColumns
port: input_data
Property | Type | Required | Description |
|---|---|---|---|
| string | Yes | Must be |
| string | Yes | Name of the input port to get column names from. Must match the |
run_function
The run_function property lets you embed Python code directly in the YAML configuration for python-run-function operators. This eliminates the need to register a separate Unity Catalog function.
run_function:
type: inline
code: |
def run(config, inputs, spark):
df = inputs["data"]
threshold = config["threshold"]
return {"out": df.filter(df["score"] > threshold)}
Property | Type | Required | Description |
|---|---|---|---|
| string | Yes | Must be |
| string | Yes | Python source code. Must define a |
The run() function receives three arguments:
config: A dictionary of configuration values set by the user in the UI.inputs: A dictionary mapping input port names to DataFrames.spark: The active SparkSession.
The function must return a dictionary mapping output port names to DataFrames. The keys must exactly match the name field of each output port defined in ports.output. For example, with an output port named out:
return {"out": result_df}
With multiple output ports:
return {"match": match_df, "rest": rest_df}
environment
The environment property specifies the Python environment for python-run-function operators. Use it to pin the environment version and declare pip dependencies.
environment:
environment_version: '4'
dependencies:
- 'scikit-learn>=1.3'
- 'pandas>=2.0'
Property | Type | Required | Description |
|---|---|---|---|
| string | No | The environment version to use. For example, |
| array of strings | No | List of pip dependency specifiers. Each entry follows standard pip syntax (for example, |
Complete examples
UC-based UDF
This example defines a Unity Catalog-based UDF operator that calculates compound interest.
schema: user-defined-operator-v0.1.0
type: uc-udf
name: Compound Interest
id: finance.compound_interest
version: '1.0.0'
description: >
Calculates compound interest based on principal, rate, and time period.
config:
type: object
properties:
principal:
type: string
title: Principal Amount
format: expression
x-ui:
widget: expression
port: input_data
annual_rate:
type: number
title: Annual Interest Rate
default: 5.0
minimum: 0
maximum: 100
x-ui:
widget: number
years:
type: number
title: Number of Years
default: 10
minimum: 1
maximum: 50
x-ui:
widget: slider
step: 1
compound_frequency:
type: string
title: Compounding Frequency
default: 'monthly'
x-ui:
widget: select
optionsSource:
type: static
values: ['daily', 'monthly', 'quarterly', 'annually']
required: [principal, annual_rate]
additionalProperties: false
ports:
input:
- name: input_data
title: Input Data
output:
- name: out
title: Output
Python run-function operator
This example defines a python-run-function operator that segments customers using K-Means clustering.
schema: user-defined-operator-v0.1.0
type: python-run-function
name: Customer Segmentation
id: ml.customer_segmentation
version: '1.2.0'
description: >
Segments customers into groups based on selected features
using K-Means clustering. Returns customer IDs with their
assigned segment numbers.
config:
type: object
properties:
num_segments:
type: integer
title: Number of Segments
description: How many customer segments to create
default: 3
minimum: 2
maximum: 20
x-ui:
widget: number
customer_id_column:
type: string
title: Customer ID Column
description: Column containing customer identifiers
x-ui:
widget: select
optionsSource:
type: inputColumns
port: customer_data
feature_columns:
type: array
title: Feature Columns
description: Columns to use for segmentation
items:
type: string
x-ui:
widget: multi-select
optionsSource:
type: inputColumns
port: customer_data
normalize_features:
type: boolean
title: Normalize Features
description: Whether to normalize feature values before clustering
default: true
x-ui:
widget: toggle
required: [num_segments, customer_id_column, feature_columns]
additionalProperties: false
ports:
input:
- name: customer_data
title: Customer Data
mime: application/vnd.databricks.dataframe
output:
- name: segmented_customers
title: Segmented Customers
run_function:
type: inline
code: |
def run(config, inputs, spark):
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
df = inputs["customer_data"]
id_col = config["customer_id_column"]
features = config["feature_columns"]
k = config["num_segments"]
normalize = config.get("normalize_features", True)
assembler = VectorAssembler(inputCols=features, outputCol="features_vec")
assembled = assembler.transform(df)
if normalize:
scaler = StandardScaler(inputCol="features_vec", outputCol="scaled_features")
model = scaler.fit(assembled)
assembled = model.transform(assembled)
feature_col = "scaled_features"
else:
feature_col = "features_vec"
kmeans = KMeans(k=k, featuresCol=feature_col, predictionCol="segment")
result = kmeans.fit(assembled).transform(assembled)
return {"segmented_customers": result.select(id_col, "segment")}
environment:
environment_version: '4'
dependencies:
- 'scikit-learn>=1.3'
Quick reference
Required root properties
schema:user-defined-operator-v0.1.0name: Display nameid: Unique identifierdescription: What the operator doesconfig: JSON Schema objecttype:uc-udf,uc-udtf, orpython-run-functionversion: Author-defined version string
Optional root properties
ports: Input and output port definitionsrun_function: Inline Python code (python-run-functiononly)environment: Python environment and dependencies (python-run-functiononly)
Config property data types
string | boolean | number | integer | array | object
UI widgets
input | textarea | checkbox | toggle | number | slider | select | multi-select | expression
Options sources
static (fixed values) | inputColumns (from input port)
Format values
expression | table_source | file_source | column_expressions | sort_expressions | aggregation_expressions | ai_function_expressions | is_preview | string[]