Skip to main content

to_avro

Converts a column into binary of Avro format.

If both subject and schemaRegistryAddress are provided, the function converts a column into binary of Schema Registry Avro format. The input data schema must have been registered to the given subject in Schema Registry, or the query fails at runtime.

Syntax

Python
from pyspark.sql.avro.functions import to_avro

to_avro(data, jsonFormatSchema=None, subject=None, schemaRegistryAddress=None, options=None)

Parameters

Parameter

Type

Description

data

pyspark.sql.Column or str

The data column to serialize.

jsonFormatSchema

str, optional

User-specified output Avro schema in JSON string format.

subject

pyspark.sql.Column or str, optional

The subject in Schema Registry that the data belongs to.

schemaRegistryAddress

str, optional

The address (host and port) of the Schema Registry.

options

dict, optional

Options to control how the Avro record is serialized and configuration for the schema registry client.

Returns

pyspark.sql.Column: A new column containing the Avro-encoded binary data.

Examples

Example 1: Converting a string column to Avro binary format

Python
from pyspark.sql.avro.functions import to_avro

data = ['SPADES']
df = spark.createDataFrame(data, "string")
df.select(to_avro(df.value).alias("avro")).show(truncate=False)
Output
+--------------------+
|avro |
+--------------------+
|[00 0C 53 50 41 4...|
+--------------------+

Example 2: Converting a string column to Avro using a custom JSON schema

Python
from pyspark.sql.avro.functions import to_avro

data = ['SPADES']
df = spark.createDataFrame(data, "string")
json_format_schema = '''["null", {"type": "enum", "name": "value",
"symbols": ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]}]'''
df.select(to_avro(df.value, json_format_schema).alias("avro")).show(truncate=False)
Output
+--------+
|avro |
+--------+
|[02 00] |
+--------+