User-defined operator YAML reference

Preview

This page describes the YAML configuration for user-defined operators in Lakeflow Designer. All operator types (uc-udf, uc-udtf, and python-run-function) use the user-defined-operator-v0.1.0 schema, which defines configuration fields using the JSON Schema format.

For information about how to build user-defined operators, see User-defined operators in Lakeflow Designer.

Root properties

Every operator YAML file starts with a set of root properties that identify the operator and define its behavior. The following example shows the general structure:

YAML
schema: user-defined-operator-v0.1.0
type: python-run-function
name: My Operator
id: my_operator
version: '1.0.0'
description: >
  What this operator does.
  Can be multiple lines.
config:
  type: object
  properties:
    my_field:
      type: string
      title: My Field
      description: Help text
ports:
  input:
    - name: data
      title: Input Data
  output:
    - name: out
      title: Output
run_function:
  type: inline
  code: |
    def run(config, inputs, spark):
        return {"out": inputs["data"]}
environment:
  environment_version: '4'
  dependencies:
    - 'pandas>=2.0'

Property	Type	Required	Description
`schema`	string	Yes	Schema identifier. Must be `user-defined-operator-v0.1.0`.
`type`	string	Yes	Type of operator: `uc-udf`, `uc-udtf`, or `python-run-function`.
`name`	string	Yes	Display name for the operator. Keep it short to fit the Lakeflow Designer UI. Minimum length of 1 character.
`id`	string	Yes	Unique identifier for the operator type. Minimum length of 1 character. Consider using namespaces (such as `finance.` or `ml.`) to categorize operators.
`description`	string	Yes	Detailed description of what the operator does. Shown to users in the UI. Use YAML multi-line syntax (`>`) for longer descriptions.
`config`	object	Yes	JSON Schema object that defines configuration fields. See Config.
`ports`	object	No	Input and output port definitions. See Ports.
`version`	string	Yes	Version string (for example, `"1.0.0"`). Use this to track your own operator releases.
`run_function`	object	No	Inline Python code for `python-run-function` operators. See `run_function`.
`environment`	object	No	Python environment configuration, including dependencies. See `environment`.

Ports

Ports define how your operator connects to other operators in the pipeline. The ports object contains input and output arrays.

YAML
ports:
  input:
    - name: input_data
      title: Input Data
      mime: application/vnd.databricks.dataframe
      allowMultiple: true
      required: true
  output:
    - name: out
      title: Output

Property	Type	Required	Description
`name`	string	Yes	Unique identifier for the port. Used in connections and config references.
`title`	string	No	Human-readable label displayed in the UI.
`mime`	string	No	MIME type for the port data. For example, `application/vnd.databricks.dataframe`.
`allowMultiple`	boolean	No	If `true`, the port accepts multiple incoming connections.
`required`	boolean	No	If `false`, the port is optional. Default: `true`.

Only the documented port properties are accepted. Unknown keys (such as the legacy label field) are rejected by schema validation.

Port examples

UDF with input and output ports:

YAML
ports:
  input:
    - name: in
      title: Input Data
  output:
    - name: out
      title: Output

UDTF with input and output ports:

YAML
ports:
  input:
    - name: input_data
      title: Input Data
  output:
    - name: clustered_data
      title: Clustered Results

python-run-function with multiple inputs and an optional port:

YAML
ports:
  input:
    - name: main_data
      title: Main Data
    - name: reference_data
      title: Reference Table
      required: false
  output:
    - name: joined_output
      title: Joined Output

Config

The config field is a JSON Schema object. You define each configuration field as a property within the schema. This format gives you access to standard JSON Schema validation features like enum, minimum, maximum, and examples.

The config object must have type: object and a properties map. You can optionally include required (an array of required property names) and additionalProperties.

YAML
config:
  type: object
  properties:
    cluster_count:
      type: number
      title: Number of Clusters
      description: How many clusters to create
      default: 3
      minimum: 1
      maximum: 100
    algorithm:
      type: string
      title: Algorithm
      description: Clustering algorithm to use
      enum: ['kmeans', 'dbscan', 'hierarchical']
      default: kmeans
    feature_col:
      type: string
      title: Feature Column
      description: Column to use as input
      format: expression
      x-ui:
        widget: expression
        port: data
  required: [cluster_count, feature_col]
  additionalProperties: false

Config property fields

Each property in the config.properties object supports the following standard JSON Schema fields:

Field	Type	Description
`type`	string	Data type: `string`, `number`, `integer`, `boolean`, `array`, or `object`.
`title`	string	Human-readable label displayed in the UI.
`description`	string	Help text shown to users.
`default`	any	Default value for the field.
`examples`	array	Example values for the field.
`enum`	array	Fixed list of allowed values.
`format`	string	Semantic type hint. See Format values.
`minimum`	number	Minimum allowed value (for `number` and `integer` types).
`maximum`	number	Maximum allowed value (for `number` and `integer` types).
`items`	object	Schema for array elements (when `type` is `array`).
`properties`	object	Nested property definitions (when `type` is `object`).
`required`	array	List of required nested property names (when `type` is `object`).

Other standard JSON Schema fields such as minLength, maxLength, pattern, and const are also supported.

Format values

The format field on a config property provides a semantic type hint that tells Lakeflow Designer how to interpret the value. These hints enable specialized UI behavior and validation.

Format	Description
`expression`	Column reference or SQL expression.
`table_source`	Table source reference.
`file_source`	File source reference.
`column_expressions`	Column expressions.
`sort_expressions`	Sort expressions.
`aggregation_expressions`	Aggregation expressions.
`ai_function_expressions`	AI function expressions.
`is_preview`	Automatic preview mode flag. Lakeflow Designer sets this to `true` during workflow preview. The config property name is arbitrary — only the `format: is_preview` tag matters. Use this to skip side effects like external API calls during preview.
`string[]`	String array.

UI widgets

Widgets customize how a config field renders in the Lakeflow Designer interface. Define widgets in the x-ui property on each config property. If you omit the widget, Lakeflow Designer uses a default widget based on the data type.

Widget	Data type	Description
`input`	string	Single-line text input.
`textarea`	string	Multi-line text area. Supports optional `rows` property.
`checkbox`	boolean	Standard checkbox.
`toggle`	boolean	Toggle switch.
`number`	number/integer	Numeric input with optional constraints.
`slider`	number/integer	Visual slider for numeric ranges. Supports optional `step` property.
`select`	string	Single-select dropdown. Requires `optionsSource`.
`multi-select`	array	Multi-select dropdown. Requires `optionsSource`.
`expression`	string	Column/expression selector. Requires `port`.

`input`

Single-line text input field.

YAML
api_endpoint:
  type: string
  title: API Endpoint
  x-ui:
    widget: input

`textarea`

Multi-line text area for longer content. Supports an optional rows property to control the height.

YAML
message_body:
  type: string
  title: Message Body
  x-ui:
    widget: textarea
    rows: 4

`checkbox`

Standard checkbox for boolean values.

YAML
send_notification:
  type: boolean
  title: Send Notification
  default: false
  x-ui:
    widget: checkbox

`toggle`

Toggle switch for boolean values.

YAML
enable_logging:
  type: boolean
  title: Enable Logging
  default: true
  x-ui:
    widget: toggle

`number`

Numeric input field. Use minimum and maximum on the property itself to constrain the range.

YAML
num_clusters:
  type: number
  title: Number of Clusters
  default: 3
  minimum: 1
  maximum: 100
  x-ui:
    widget: number

`slider`

Visual slider for selecting numeric values within a range. Use minimum and maximum on the property to set the range, and step in x-ui to control the increment.

YAML
confidence_threshold:
  type: number
  title: Confidence Threshold
  default: 0.8
  minimum: 0
  maximum: 1
  x-ui:
    widget: slider
    step: 0.05

`select`

Single-select dropdown. Requires an optionsSource to define where the dropdown values come from. See Options sources.

YAML
aggregation_type:
  type: string
  title: Aggregation Type
  x-ui:
    widget: select
    optionsSource:
      type: static
      values: ['sum', 'avg', 'min', 'max', 'count']

`multi-select`

Multi-select dropdown for choosing multiple values. Use type: array with items: { type: string } on the property. Requires an optionsSource. See Options sources.

YAML
feature_columns:
  type: array
  title: Feature Columns
  items:
    type: string
  x-ui:
    widget: multi-select
    optionsSource:
      type: inputColumns
      port: input_data

`expression`

Column/expression selector that lets users pick a column from input data or write a custom SQL expression. Set format: expression on the property and specify the input port in x-ui. This is useful:

When the user should select a column from the input data.
When the user might want to write a custom SQL expression.
For parameters that reference dynamic data in the pipeline.

YAML
amount:
  type: string
  title: Amount
  format: expression
  x-ui:
    widget: expression
    port: input_data

Options sources

For select and multi-select widgets, you must define where the dropdown options come from using optionsSource.

Static options

A fixed list of values defined in the YAML.

YAML
optionsSource:
  type: static
  values: ['option1', 'option2', 'option3']

Property	Type	Required	Description
`type`	string	Yes	Must be `static`.
`values`	array	Yes	Array of string values for the dropdown.

Input columns

Dynamically populates the dropdown with column names from an input port.

YAML
optionsSource:
  type: inputColumns
  port: input_data

Property	Type	Required	Description
`type`	string	Yes	Must be `inputColumns`.
`port`	string	Yes	Name of the input port to get column names from. Must match the `name` of one of your defined input ports.

`run_function`

The run_function property lets you embed Python code directly in the YAML configuration for python-run-function operators. This eliminates the need to register a separate Unity Catalog function.

YAML
run_function:
  type: inline
  code: |
    def run(config, inputs, spark):
        df = inputs["data"]
        threshold = config["threshold"]
        return {"out": df.filter(df["score"] > threshold)}

Property	Type	Required	Description
`type`	string	Yes	Must be `inline`.
`code`	string	Yes	Python source code. Must define a `run()` function.

The run() function receives three arguments:

config: A dictionary of configuration values set by the user in the UI.
inputs: A dictionary mapping input port names to DataFrames.
spark: The active SparkSession.

The function must return a dictionary mapping output port names to DataFrames. The keys must exactly match the name field of each output port defined in ports.output. For example, with an output port named out:

Python
return {"out": result_df}

With multiple output ports:

Python
return {"match": match_df, "rest": rest_df}

`environment`

The environment property specifies the Python environment for python-run-function operators. Use it to pin the environment version and declare pip dependencies.

YAML
environment:
  environment_version: '4'
  dependencies:
    - 'scikit-learn>=1.3'
    - 'pandas>=2.0'

Property	Type	Required	Description
`environment_version`	string	No	The environment version to use. For example, `"4"`.
`dependencies`	array of strings	No	List of pip dependency specifiers. Each entry follows standard pip syntax (for example, `"pandas>=2.0"`).

Complete examples

UC-based UDF

This example defines a Unity Catalog-based UDF operator that calculates compound interest.

YAML
schema: user-defined-operator-v0.1.0
type: uc-udf
name: Compound Interest
id: finance.compound_interest
version: '1.0.0'
description: >
  Calculates compound interest based on principal, rate, and time period.

config:
  type: object
  properties:
    principal:
      type: string
      title: Principal Amount
      format: expression
      x-ui:
        widget: expression
        port: input_data

    annual_rate:
      type: number
      title: Annual Interest Rate
      default: 5.0
      minimum: 0
      maximum: 100
      x-ui:
        widget: number

    years:
      type: number
      title: Number of Years
      default: 10
      minimum: 1
      maximum: 50
      x-ui:
        widget: slider
        step: 1

    compound_frequency:
      type: string
      title: Compounding Frequency
      default: 'monthly'
      x-ui:
        widget: select
        optionsSource:
          type: static
          values: ['daily', 'monthly', 'quarterly', 'annually']
  required: [principal, annual_rate]
  additionalProperties: false

ports:
  input:
    - name: input_data
      title: Input Data
  output:
    - name: out
      title: Output

Python run-function operator

This example defines a python-run-function operator that segments customers using K-Means clustering.

YAML
schema: user-defined-operator-v0.1.0
type: python-run-function
name: Customer Segmentation
id: ml.customer_segmentation
version: '1.2.0'
description: >
  Segments customers into groups based on selected features
  using K-Means clustering. Returns customer IDs with their
  assigned segment numbers.

config:
  type: object
  properties:
    num_segments:
      type: integer
      title: Number of Segments
      description: How many customer segments to create
      default: 3
      minimum: 2
      maximum: 20
      x-ui:
        widget: number
    customer_id_column:
      type: string
      title: Customer ID Column
      description: Column containing customer identifiers
      x-ui:
        widget: select
        optionsSource:
          type: inputColumns
          port: customer_data
    feature_columns:
      type: array
      title: Feature Columns
      description: Columns to use for segmentation
      items:
        type: string
      x-ui:
        widget: multi-select
        optionsSource:
          type: inputColumns
          port: customer_data
    normalize_features:
      type: boolean
      title: Normalize Features
      description: Whether to normalize feature values before clustering
      default: true
      x-ui:
        widget: toggle
  required: [num_segments, customer_id_column, feature_columns]
  additionalProperties: false

ports:
  input:
    - name: customer_data
      title: Customer Data
      mime: application/vnd.databricks.dataframe
  output:
    - name: segmented_customers
      title: Segmented Customers

run_function:
  type: inline
  code: |
    def run(config, inputs, spark):
        from pyspark.ml.feature import VectorAssembler, StandardScaler
        from pyspark.ml.clustering import KMeans

        df = inputs["customer_data"]
        id_col = config["customer_id_column"]
        features = config["feature_columns"]
        k = config["num_segments"]
        normalize = config.get("normalize_features", True)

        assembler = VectorAssembler(inputCols=features, outputCol="features_vec")
        assembled = assembler.transform(df)

        if normalize:
            scaler = StandardScaler(inputCol="features_vec", outputCol="scaled_features")
            model = scaler.fit(assembled)
            assembled = model.transform(assembled)
            feature_col = "scaled_features"
        else:
            feature_col = "features_vec"

        kmeans = KMeans(k=k, featuresCol=feature_col, predictionCol="segment")
        result = kmeans.fit(assembled).transform(assembled)

        return {"segmented_customers": result.select(id_col, "segment")}

environment:
  environment_version: '4'
  dependencies:
    - 'scikit-learn>=1.3'

Quick reference

Required root properties

schema: user-defined-operator-v0.1.0
name: Display name
id: Unique identifier
description: What the operator does
config: JSON Schema object
type: uc-udf, uc-udtf, or python-run-function
version: Author-defined version string

Optional root properties

ports: Input and output port definitions
run_function: Inline Python code (python-run-function only)
environment: Python environment and dependencies (python-run-function only)

Config property data types

UI widgets

Options sources

static (fixed values) | inputColumns (from input port)

Format values

Root properties​

Ports​

Port examples​

Config​

Config property fields​

Format values​

UI widgets​

input​

textarea​

checkbox​

toggle​

number​

slider​

select​

multi-select​

expression​

Options sources​

Static options​

Input columns​

run_function​

environment​

Complete examples​

UC-based UDF​

Python run-function operator​

Quick reference​

Required root properties​

Optional root properties​

Config property data types​

UI widgets​

Options sources​

Format values​

Root properties

Ports

Port examples

Config

Config property fields

Format values

UI widgets

`input`

`textarea`

`checkbox`

`toggle`

`number`

`slider`

`select`

`multi-select`

`expression`

Options sources

Static options

Input columns

`run_function`

`environment`

Complete examples

UC-based UDF

Python run-function operator

Quick reference

Required root properties

Optional root properties

Config property data types

UI widgets

Options sources

Format values