Skip to main content

from_avro

Converts a binary column of Avro format into its corresponding catalyst value. The specified schema must match the read data, otherwise the behavior is undefined: it may fail or return an arbitrary result.

If jsonFormatSchema is not provided but both subject and schemaRegistryAddress are provided, the function converts a binary column of Schema Registry Avro format into its corresponding catalyst value.

Syntax

Python
from pyspark.sql.avro.functions import from_avro

from_avro(data, jsonFormatSchema=None, options=None, subject=None, schemaRegistryAddress=None)

Parameters

Parameter

Type

Description

data

pyspark.sql.Column or str

The binary column containing Avro-encoded data.

jsonFormatSchema

str, optional

The Avro schema in JSON string format.

options

dict, optional

Options to control how the Avro record is parsed and configuration for the schema registry client.

subject

str, optional

The subject in Schema Registry that the data belongs to.

schemaRegistryAddress

str, optional

The address (host and port) of the Schema Registry.

Returns

pyspark.sql.Column: A new column containing the deserialized Avro data as the corresponding catalyst value.

Examples

Example 1: Deserializing an Avro binary column using a JSON schema

Python
from pyspark.sql import Row
from pyspark.sql.avro.functions import from_avro, to_avro

data = [(1, Row(age=2, name='Alice'))]
df = spark.createDataFrame(data, ("key", "value"))
avro_df = df.select(to_avro(df.value).alias("avro"))
json_format_schema = '''{"type":"record","name":"topLevelRecord","fields":
[{"name":"avro","type":[{"type":"record","name":"value",
"namespace":"topLevelRecord","fields":[{"name":"age","type":["long","null"]},
{"name":"name","type":["string","null"]}]},"null"]}]}'''
avro_df.select(from_avro(avro_df.avro, json_format_schema).alias("value")).show(truncate=False)
Output
+------------------+
|value |
+------------------+
|{{2, Alice}} |
+------------------+