to_avro
Converts a column into binary of Avro format.
If both subject and schemaRegistryAddress are provided, the function converts a column into binary of Schema Registry Avro format. The input data schema must have been registered to the given subject in Schema Registry, or the query fails at runtime.
Syntax
from pyspark.sql.avro.functions import to_avro
to_avro(data, jsonFormatSchema=None, subject=None, schemaRegistryAddress=None, options=None)
Parameters
Parameter | Type | Description |
|---|---|---|
|
| The data column to serialize. |
| str, optional | User-specified output Avro schema in JSON string format. |
|
| The subject in Schema Registry that the data belongs to. |
| str, optional | The address (host and port) of the Schema Registry. |
| dict, optional | Options to control how the Avro record is serialized and configuration for the schema registry client. |
Returns
pyspark.sql.Column: A new column containing the Avro-encoded binary data.
Examples
Example 1: Converting a string column to Avro binary format
from pyspark.sql.avro.functions import to_avro
data = ['SPADES']
df = spark.createDataFrame(data, "string")
df.select(to_avro(df.value).alias("avro")).show(truncate=False)
+--------------------+
|avro |
+--------------------+
|[00 0C 53 50 41 4...|
+--------------------+
Example 2: Converting a string column to Avro using a custom JSON schema
from pyspark.sql.avro.functions import to_avro
data = ['SPADES']
df = spark.createDataFrame(data, "string")
json_format_schema = '''["null", {"type": "enum", "name": "value",
"symbols": ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]}]'''
df.select(to_avro(df.value, json_format_schema).alias("avro")).show(truncate=False)
+--------+
|avro |
+--------+
|[02 00] |
+--------+