Skip to main content

from_xml

Parses a column containing a XML string to a row with the specified schema. Returns null, in the case of an unparsable string.

Syntax

Python
from pyspark.sql import functions as sf

sf.from_xml(col, schema, options=None)

Parameters

Parameter

Type

Description

col

pyspark.sql.Column or str

A column or column name in XML format.

schema

StructType, pyspark.sql.Column or str

A StructType, Column or Python string literal with a DDL-formatted string to use when parsing the Xml column.

options

dict, optional

Options to control parsing. Accepts the same options as the Xml datasource.

Returns

pyspark.sql.Column: a new column of complex type from given XML object.

Examples

Example 1: Parsing XML with a DDL-formatted string schema

Python
import pyspark.sql.functions as sf
data = [(1, '''<p><a>1</a></p>''')]
df = spark.createDataFrame(data, ("key", "value"))
# Define the schema using a DDL-formatted string
schema = "STRUCT<a: BIGINT>"
# Parse the XML column using the DDL-formatted schema
df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
Output
[Row(xml=Row(a=1))]

Example 2: Parsing XML with a StructType schema

Python
import pyspark.sql.functions as sf
from pyspark.sql.types import StructType, LongType
data = [(1, '''<p><a>1</a></p>''')]
df = spark.createDataFrame(data, ("key", "value"))
schema = StructType().add("a", LongType())
df.select(sf.from_xml(df.value, schema)).show()
Output
+---------------+
|from_xml(value)|
+---------------+
| {1}|
+---------------+

Example 3: Parsing XML with ArrayType in schema

Python
import pyspark.sql.functions as sf
data = [(1, '<p><a>1</a><a>2</a></p>')]
df = spark.createDataFrame(data, ("key", "value"))
# Define the schema with an Array type
schema = "STRUCT<a: ARRAY<BIGINT>>"
# Parse the XML column using the schema with an Array
df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
Output
[Row(xml=Row(a=[1, 2]))]

Example 4: Parsing XML using schema_of_xml

Python
import pyspark.sql.functions as sf
# Sample data with an XML column
data = [(1, '<p><a>1</a><a>2</a></p>')]
df = spark.createDataFrame(data, ("key", "value"))
# Generate the schema from an example XML value
schema = sf.schema_of_xml(sf.lit(data[0][1]))
# Parse the XML column using the generated schema
df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
Output
[Row(xml=Row(a=[1, 2]))]