from_xml function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 14.1 and above

Preview

This feature is in Private Preview. To try it, reach out to your Databricks contact.

Returns a struct value parsed from the xmlStr using schema.

Syntax

from_xml(xmlStr, schema [, options])

Arguments

  • xmlStr: A STRING expression specifying an XML document.

  • schema: A STRING expression or invocation of schema_of_xml function.

  • options: An optional MAP<STRING,STRING> literal specifying directives.

Returns

A STRUCT with field names and types matching the schema definition.

xmlStr should be well-formed with respect to schema and options. If xmlStr cannot be parsed NULL is returned.

schema must be defined as comma-separated column name and data type pairs as used in for example CREATE TABLE.

options, if provided, can be any of the following:

  • 'excludeAttribute' (default false): Whether to exclude attributes in elements.

  • mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.

    • PERMISSIVE: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, you can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema.

    • FAILFAST: throws an exception when it meets corrupted records.

  • columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.

  • 'inferSchema (default true): if true, attempts to infer an appropriate type for each resulting attribute, like a boolean, numeric or date type. If false, all resulting columns are of string type.

  • 'attributePrefix' (default '_'): The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Can be an empty string.

  • 'valueTag' (default '_VALUE'): The tag used for the value when there are attributes in the element having no child.

  • 'encoding' (default UTF-8): decodes the XML files by the specified encoding type.

  • 'ignoreSurroundingSpaces' (default false): Defines whether surrounding whitespaces from values being read should be skipped.

  • 'rowValidationXSDPath': Path to an XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred. Note that if the same local path is not already also visible on the executors in the cluster, then the XSD and any others it depends on should be added to the Spark executors with SparkContext.addFile. In this case, to use local XSD /foo/bar.xsd, call` addFile(‘/foo/bar.xsd’) and pass just 'bar.xsd' as rowValidationXSDPath`.

  • 'ignoreNamespace' (default false): If true, namespaces prefixes on XML elements and attributes are ignored. Tags <abc:author> and <def:author> would, for example, be treated as if both are just <author>. Note that namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.

  • 'timestampFormat' (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]): sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime patterns. This applies to TIMESTAMP.

  • 'dateFormat' (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at Datetime patterns. This applies to DATE.

  • locale (default is en-US): sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.

  • nullValue (default is 'null'): Sets the string representation of a null value.

Examples

> SELECT from_xml('<p><a>1</a><b>0.8</b></p>', 'a INT, b DOUBLE');
 {"a":1,"b":0.8}

> SELECT from_xml('<p><time>26/08/2015</time></p>', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
 { "time": "2015-08-26T00:00:00.000+0000"}

> SELECT from_xml('<p><teacher>Alice</teacher><student><name>Bob</name><rank>1</rank></student><student><name>Charlie</name><rank>2</rank></student></p>',
                  'STRUCT<teacher: STRING, student: ARRAY<STRUCT<name: STRING, rank: INT>>>');
 {"teacher":"Alice","student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}