XML file

This article describes how to read and write an XML file as an Apache Spark data source.

Requirements

  1. Create the spark-xml library as a Maven library. For the Maven coordinate, specify:

    • Databricks Runtime 7.x and above: com.databricks:spark-xml_2.12:<release>
    • Databricks Runtime 5.5 LTS and 6.x: com.databricks:spark-xml_2.11:<release>

    See spark-xml Releases for the latest version of <release>.

  2. Install the library on a cluster.

Example

The example in this section uses the books XML file.

  1. Retrieve the books XML file:

    $ wget https://github.com/databricks/spark-xml/raw/master/src/test/resources/books.xml
    
  2. Upload the file to DBFS.

Read and write XML data

/*Infer schema*/

CREATE TABLE books
USING xml
OPTIONS (path "dbfs:/books.xml", rowTag "book")

/*Specify column names and types*/

CREATE TABLE books (author string, description string, genre string, _id string, price double, publish_date string, title string)
USING xml
OPTIONS (path "dbfs:/books.xml", rowTag "book")
// Infer schema

import com.databricks.spark.xml._ // Add the DataFrame.read.xml() method

val df = spark.read
  .option("rowTag", "book")
  .xml("dbfs:/books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .option("rootTag", "books")
  .option("rowTag", "book")
  .xml("dbfs:/newbooks.xml")

// Specify schema

import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType}

val customSchema = StructType(Array(
  StructField("_id", StringType, nullable = true),
  StructField("author", StringType, nullable = true),
  StructField("description", StringType, nullable = true),
  StructField("genre", StringType, nullable = true),
  StructField("price", DoubleType, nullable = true),
  StructField("publish_date", StringType, nullable = true),
  StructField("title", StringType, nullable = true)))

val df = spark.read
  .option("rowTag", "book")
  .schema(customSchema)
  .xml("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .option("rootTag", "books")
  .option("rowTag", "book")
  .xml("dbfs:/newbooks.xml")
# Infer schema

library(SparkR)

sparkR.session("local[4]", sparkPackages = c("com.databricks:spark-xml_2.12:<release>"))

df <- read.df("dbfs:/books.xml", source = "xml", rowTag = "book")

# Default `rootTag` and `rowTag`
write.df(df, "dbfs:/newbooks.xml", "xml")

# Specify schema

customSchema <- structType(
  structField("_id", "string"),
  structField("author", "string"),
  structField("description", "string"),
  structField("genre", "string"),
  structField("price", "double"),
  structField("publish_date", "string"),
  structField("title", "string"))

df <- read.df("dbfs:/books.xml", source = "xml", schema = customSchema, rowTag = "book")

# In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
write.df(df, "dbfs:/newbooks.xml", "xml", "overwrite")

Options

  • Read
    • path: Location of XML files. Accepts standard Hadoop globbing expressions.
    • rowTag: The row tag to treat as a row. For example, in this XML <books><book><book>...</books>, the value would be book. Default is ROW.
    • samplingRatio: Sampling ratio for inferring schema (0.0 ~ 1). Default is 1. Possible types are StructType, ArrayType, StringType, LongType, DoubleType, BooleanType, TimestampType and NullType, unless you provide a schema.
    • excludeAttribute: Whether to exclude attributes in elements. Default is false.
    • nullValue: The value to treat as a null value. Default is "".
    • mode: The mode for dealing with corrupt records. Default is PERMISSIVE.
      • PERMISSIVE:
        • When it encounters a corrupted record, sets all fields to null and puts the malformed string into a new field configured by columnNameOfCorruptRecord.
        • When it encounters a field of the wrong data type, sets the offending field to null.
      • DROPMALFORMED: ignores corrupted records.
      • FAILFAST: throws an exception when it detects corrupted records.
    • inferSchema: if true, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If false, all resulting columns are of string type. Default is true.
    • columnNameOfCorruptRecord: The name of new field where malformed strings are stored. Default is _corrupt_record.
    • attributePrefix: The prefix for attributes so that to differentiate attributes and elements. This is the prefix for field names. Default is _.
    • valueTag: The tag used for the value when there are attributes in an element that has no child elements. Default is _VALUE.
    • charset: Defaults to UTF-8 but can be set to other valid charset names.
    • ignoreSurroundingSpaces: Whether or not whitespaces surrounding values should be skipped. Default is false.
    • rowValidationXSDPath: Path to an XSD file that is used to validate the XML for each row. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided or inferred. If the same local path is not already also visible on the executors in the cluster, then the XSD and any others it depends on should be added to the Spark executors with SparkContext.addFile. In this case, to use local XSD /foo/bar.xsd, call addFile("/foo/bar.xsd") and pass "bar.xsd" as rowValidationXSDPath.
  • Write
    • path: Location to write files.
    • rowTag: The row tag to treat as a row. For example, in this XML <books><book><book>...</books>, the value would be book. Default is ROW.
    • rootTag: The root tag to treat as the root. For example, in this XML <books><book><book>...</books>, the value would be books. Default is ROWS.
    • nullValue: The value to write null value. Default is the string "null". When "null", it does not write attributes and elements for fields.
    • attributePrefix: The prefix for attributes to differentiate attributes and elements. This is the prefix for field names. Default is _.
    • valueTag: The tag used for the value when there are attributes in an element that has no child elements. Default is _VALUE.
    • compression: Compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive short names (bzip2, gzip, lz4, and snappy). Default is no compression.

Supports the shortened name usage; You can use xml instead of com.databricks.spark.xml.

XSD support

You can validate individual rows against an XSD schema using rowValidationXSDPath.

You use the utility com.databricks.spark.xml.util.XSDToSchema to extract a Spark DataFrame schema from some XSD files. It supports only simple, complex and sequence types, only basic XSD functionality, and is experimental.

import com.databricks.spark.xml.util.XSDToSchema
import java.nio.file.Paths

val schema = XSDToSchema.read(Paths.get("/path/to/your.xsd"))
val df = spark.read.schema(schema)....xml(...)

Parse nested XML

Although primarily used to convert an XML file into a DataFrame, you can also use the from_xml method to parse XML in a string-valued column in an existing DataFrame and add it as a new column with parsed results as a struct with:

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._

val df = ... /// DataFrame with XML in column 'payload'
val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))

Note

  • mode:
    • If set to PERMISSIVE, the default, the parse mode instead defaults to DROPMALFORMED. If you include a column in the schema for from_xml that matches the columnNameOfCorruptRecord, then PERMISSIVE mode outputs malformed records to that column in the resulting struct.
    • If set to DROPMALFORMED, XML values that do not parse correctly result in a null value for the column. No rows are dropped.
  • from_xml converts arrays of strings containing XML to arrays of parsed structs. Use schema_of_xml_array instead.
  • from_xml_string is an alternative for use in UDFs that operates on a String directly instead of a column.

Conversion rules

Due to structural differences between DataFrames and XML, there are some conversion rules from XML data to DataFrame and from DataFrame to XML data. You can disable handling attributes with the option excludeAttribute.

Convert XML to DataFrame

  • Attributes: Attributes are converted as fields with the prefix specified in the attributePrefix option. If attributePrefix is _, the document

    <one myOneAttrib="AAAA">
        <two>two</two>
        <three>three</three>
    </one>
    

    produces the schema:

    root
     |-- _myOneAttrib: string (nullable = true)
     |-- two: string (nullable = true)
     |-- three: string (nullable = true)
    
  • If an element has attributes but no child elements, the attribute value is put in a separate field specified in the valueTag option. If valueTag is _VALUE, the document

    <one>
        <two myTwoAttrib="BBBBB">two</two>
        <three>three</three>
    </one>
    

    produces the schema:

    root
     |-- two: struct (nullable = true)
     |    |-- _VALUE: string (nullable = true)
     |    |-- _myTwoAttrib: string (nullable = true)
     |-- three: string (nullable = true)
    

Convert DataFrame to XML

Writing a XML file from DataFrame having a field ArrayType with its element as ArrayType would have an additional nested field for the element. This would not happen in reading and writing XML data but writing a DataFrame read from other sources. Therefore, roundtrip in reading and writing XML files has the same structure but writing a DataFrame read from other sources is possible to have a different structure.

A DataFrame with the schema:

 |-- a: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

and data:

+------------------------------------+
|                                   a|
+------------------------------------+
|[WrappedArray(aa), WrappedArray(bb)]|
+------------------------------------+

produces the XML file:

<a>
  <item>aa</item>
</a>
<a>
  <item>bb</item>
</a>