Read and write XML data using the spark-xml
library
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned int his content are not officially endorsed or tested by Databricks.
Native XML file format support is available as a Public Preview. See Read and write XML files.
This article describes how to read and write an XML file as an Apache Spark data source.
Requirements
Create the
spark-xml
library as a Maven library. For the Maven coordinate, specify:Databricks Runtime 7.x and above:
com.databricks:spark-xml_2.12:<release>
See
spark-xml
Releases for the latest version of<release>
.Install the library on a cluster.
Example
The example in this section uses the books XML file.
Retrieve the books XML file:
$ wget https://github.com/databricks/spark-xml/raw/master/src/test/resources/books.xml
Upload the file to DBFS.
Read and write XML data
/*Infer schema*/
CREATE TABLE books
USING xml
OPTIONS (path "dbfs:/books.xml", rowTag "book")
/*Specify column names and types*/
CREATE TABLE books (author string, description string, genre string, _id string, price double, publish_date string, title string)
USING xml
OPTIONS (path "dbfs:/books.xml", rowTag "book")
// Infer schema
import com.databricks.spark.xml._ // Add the DataFrame.read.xml() method
val df = spark.read
.option("rowTag", "book")
.xml("dbfs:/books.xml")
val selectedData = df.select("author", "_id")
selectedData.write
.option("rootTag", "books")
.option("rowTag", "book")
.xml("dbfs:/newbooks.xml")
// Specify schema
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType}
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("author", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType, nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true)))
val df = spark.read
.option("rowTag", "book")
.schema(customSchema)
.xml("books.xml")
val selectedData = df.select("author", "_id")
selectedData.write
.option("rootTag", "books")
.option("rowTag", "book")
.xml("dbfs:/newbooks.xml")
# Infer schema
library(SparkR)
sparkR.session("local[4]", sparkPackages = c("com.databricks:spark-xml_2.12:<release>"))
df <- read.df("dbfs:/books.xml", source = "xml", rowTag = "book")
# Default `rootTag` and `rowTag`
write.df(df, "dbfs:/newbooks.xml", "xml")
# Specify schema
customSchema <- structType(
structField("_id", "string"),
structField("author", "string"),
structField("description", "string"),
structField("genre", "string"),
structField("price", "double"),
structField("publish_date", "string"),
structField("title", "string"))
df <- read.df("dbfs:/books.xml", source = "xml", schema = customSchema, rowTag = "book")
# In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
write.df(df, "dbfs:/newbooks.xml", "xml", "overwrite")
Options
Read
path
: Location of XML files. Accepts standard Hadoop globbing expressions.rowTag
: The row tag to treat as a row. For example, in this XML<books><book><book>...</books>
, the value would bebook
. Default isROW
.samplingRatio
: Sampling ratio for inferring schema (0.0 ~ 1). Default is 1. Possible types areStructType
,ArrayType
,StringType
,LongType
,DoubleType
,BooleanType
,TimestampType
andNullType
, unless you provide a schema.excludeAttribute
: Whether to exclude attributes in elements. Default is false.nullValue
: The value to treat as anull
value. Default is""
.mode
: The mode for dealing with corrupt records. Default isPERMISSIVE
.PERMISSIVE
:When it encounters a corrupted record, sets all fields to
null
and puts the malformed string into a new field configured bycolumnNameOfCorruptRecord
.When it encounters a field of the wrong data type, sets the offending field to
null
.
DROPMALFORMED
: ignores corrupted records.FAILFAST
: throws an exception when it detects corrupted records.
inferSchema
: iftrue
, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. Iffalse
, all resulting columns are of string type. Default istrue
.columnNameOfCorruptRecord
: The name of new field where malformed strings are stored. Default is_corrupt_record
.attributePrefix
: The prefix for attributes so that to differentiate attributes and elements. This is the prefix for field names. Default is_
.valueTag
: The tag used for the value when there are attributes in an element that has no child elements. Default is_VALUE
.charset
: Defaults toUTF-8
but can be set to other valid charset names.ignoreSurroundingSpaces
: Whether or not whitespaces surrounding values should be skipped. Default is false.rowValidationXSDPath
: Path to an XSD file that is used to validate the XML for each row. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided or inferred. If the same local path is not already also visible on the executors in the cluster, then the XSD and any others it depends on should be added to the Spark executors with SparkContext.addFile:Unit). In this case, to use local XSD/foo/bar.xsd
, calladdFile("/foo/bar.xsd")
and pass"bar.xsd"
asrowValidationXSDPath
.
Write
path
: Location to write files.rowTag
: The row tag to treat as a row. For example, in this XML<books><book><book>...</books>
, the value would bebook
. Default isROW
.rootTag
: The root tag to treat as the root. For example, in this XML<books><book><book>...</books>
, the value would bebooks
. Default isROWS
.nullValue
: The value to writenull
value. Default is the string"null"
. When"null"
, it does not write attributes and elements for fields.attributePrefix
: The prefix for attributes to differentiate attributes and elements. This is the prefix for field names. Default is_
.valueTag
: The tag used for the value when there are attributes in an element that has no child elements. Default is_VALUE
.compression
: Compression codec to use when saving to file. Should be the fully qualified name of a class implementingorg.apache.hadoop.io.compress.CompressionCodec
or one of case-insensitive short names (bzip2
,gzip
,lz4
, andsnappy
). Default is no compression.
Supports the shortened name usage; You can use xml
instead of com.databricks.spark.xml
.
XSD support
You can validate individual rows against an XSD schema using rowValidationXSDPath
.
You use the utility com.databricks.spark.xml.util.XSDToSchema
to extract a Spark DataFrame
schema from some XSD files. It supports only simple, complex and sequence types, only basic XSD functionality,
and is experimental.
import com.databricks.spark.xml.util.XSDToSchema
import java.nio.file.Paths
val schema = XSDToSchema.read(Paths.get("/path/to/your.xsd"))
val df = spark.read.schema(schema)....xml(...)
Parse nested XML
Although primarily used to convert an XML file into a DataFrame, you can also use the from_xml
method to parse XML in a string-valued column in an existing DataFrame and add it as a new column with parsed results as a struct with:
import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._
val df = ... /// DataFrame with XML in column 'payload'
val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))
Note
mode
:If set to
PERMISSIVE
, the default, the parse mode instead defaults toDROPMALFORMED
. If you include a column in the schema forfrom_xml
that matches thecolumnNameOfCorruptRecord
, thenPERMISSIVE
mode outputs malformed records to that column in the resulting struct.If set to
DROPMALFORMED
, XML values that do not parse correctly result in anull
value for the column. No rows are dropped.
from_xml
converts arrays of strings containing XML to arrays of parsed structs. Useschema_of_xml_array
instead.from_xml_string
is an alternative for use in UDFs that operates on a String directly instead of a column.
Conversion rules
Due to structural differences between DataFrames and XML, there are some conversion rules from XML data to DataFrame and from DataFrame to XML data. You can disable handling attributes with the option excludeAttribute
.
Convert XML to DataFrame
Attributes: Attributes are converted as fields with the prefix specified in the
attributePrefix
option. IfattributePrefix
is_
, the document<one myOneAttrib="AAAA"> <two>two</two> <three>three</three> </one>
produces the schema:
root |-- _myOneAttrib: string (nullable = true) |-- two: string (nullable = true) |-- three: string (nullable = true)
If an element has attributes but no child elements, the attribute value is put in a separate field specified in the
valueTag
option. IfvalueTag
is_VALUE
, the document<one> <two myTwoAttrib="BBBBB">two</two> <three>three</three> </one>
produces the schema:
root |-- two: struct (nullable = true) | |-- _VALUE: string (nullable = true) | |-- _myTwoAttrib: string (nullable = true) |-- three: string (nullable = true)
Convert DataFrame to XML
Writing an XML file from DataFrame having a field ArrayType
with its element as ArrayType
would have an additional nested field for the element. This would not happen in reading and writing XML data but in writing a DataFrame read from other sources. Therefore, roundtrip in reading and writing XML files has the same structure but writing a DataFrame read from other sources is possible to have a different structure.
A DataFrame with the schema:
|-- a: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
and data:
+------------------------------------+
| a|
+------------------------------------+
|[WrappedArray(aa), WrappedArray(bb)]|
+------------------------------------+
produces the XML file:
<a>
<item>aa</item>
</a>
<a>
<item>bb</item>
</a>