Query variant data
This feature is in Public Preview.
This article describes how you can query and transform semi-structured data stored as VARIANT. The VARIANT data type is available in Databricks Runtime 15.3 and above.
Databricks recommends using VARIANT over JSON strings. For users currently using JSON strings looking to migrate, see How is variant different than JSON strings?.
If you want to see examples for querying semi-structured data stored with JSON strings, see Query JSON strings.
VARIANT columns cannot be used for clustering keys, partitions, or Z-order keys. The VARIANT data type cannot be used for comparisons, grouping, ordering, and set operations. For a full list of limitations, see Limitations.
Create a table with a variant column
To create a variant column, use the parse_json function (SQL or Python).
Run the following to create a table with highly nested data stored as VARIANT. (This data is used in other examples on this page.)
- SQL
- Python
-- Create a table with a variant column
CREATE TABLE store_data AS
SELECT parse_json(
'{
"store":{
"fruit": [
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
],
"basket":[
[1,2,{"b":"y","a":"x"}],
[3,4],
[5,6]
],
"book":[
{
"author":"Nigel Rees",
"title":"Sayings of the Century",
"category":"reference",
"price":8.95
},
{
"author":"Herman Melville",
"title":"Moby Dick",
"category":"fiction",
"price":8.99,
"isbn":"0-553-21311-3"
},
{
"author":"J. R. R. Tolkien",
"title":"The Lord of the Rings",
"category":"fiction",
"reader":[
{"age":25,"name":"bob"},
{"age":26,"name":"jack"}
],
"price":22.99,
"isbn":"0-395-19395-8"
}
],
"bicycle":{
"price":19.95,
"color":"red"
}
},
"owner":"amy",
"zip code":"94025",
"fb:testid":"1234"
}'
) as raw
SELECT * FROM store_data
# Create a table with a variant column
store_data='''
{
"store":{
"fruit":[
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
],
"basket":[
[1,2,{"b":"y","a":"x"}],
[3,4],
[5,6]
],
"book":[
{
"author":"Nigel Rees",
"title":"Sayings of the Century",
"category":"reference",
"price":8.95
},
{
"author":"Herman Melville",
"title":"Moby Dick",
"category":"fiction",
"price":8.99,
"isbn":"0-553-21311-3"
},
{
"author":"J. R. R. Tolkien",
"title":"The Lord of the Rings",
"category":"fiction",
"reader":[
{"age":25,"name":"bob"},
{"age":26,"name":"jack"}
],
"price":22.99,
"isbn":"0-395-19395-8"
}
],
"bicycle":{
"price":19.95,
"color":"red"
}
},
"owner":"amy",
"zip code":"94025",
"fb:testid":"1234"
}
'''
# Create a DataFrame
df = spark.createDataFrame([(store_data,)], ["json"])
# Convert to a variant
df_variant = df.select(parse_json(col("json")).alias("raw"))
# Alternatively, create the DataFrame directly
# df_variant = spark.range(1).select(parse_json(lit(store_data)))
df_variant.display()
# Write out as a table
df_variant.write.saveAsTable("store_data")
Query fields in a variant column
To extract fields from a variant column, use the variant_get function (SQL or Python) specifying the name of the JSON field in your extraction path. Field names are always case sensitive.
- SQL
- Python
-- Extract a top-level field
SELECT variant_get(store_data.raw, '$.owner') AS owner FROM store_data
You can also use SQL syntax to query fields in a variant column. See SQL shorthand for variant_get.
# Extract a top-level field
df_variant.select(variant_get(col("raw"), "$.owner", "string")).display()
SQL shorthand for variant_get
The SQL syntax for querying JSON strings and other complex data types on Databricks applies to VARIANT data, including the following:
- Use
:to select top level fields. - Use
.or[<key>]to select nested fields with named keys. - Use
[<index>]to select values from arrays.
SELECT raw:owner FROM store_data
+-------+
| owner |
+-------+
| "amy" |
+-------+
-- Use backticks to escape special characters.
SELECT raw:`zip code`, raw:`fb:testid` FROM store_data
+----------+-----------+
| zip code | fb:testid |
+----------+-----------+
| "94025" | "1234" |
+----------+-----------+
If a field name contains a period (.), you must escape it with square brackets ([ ]). For example, the following query selects a field named zip.code:
SELECT raw:['zip.code'] FROM store_data
Extract variant nested fields
To extract nested fields from a variant column, specify them using dot notation or brackets. Field names are always case sensitive.
- SQL
- Python
-- Use dot notation
SELECT raw:store.bicycle FROM store_data
-- Use brackets
SELECT raw:store['bicycle'] FROM store_data
If a path cannot be found, the result is NULL of type VARIANT.
# Use dot notation
df_variant.select(variant_get(col("raw"), "$.store.bicycle", "string")).display()
# Use brackets
df_variant.select(variant_get(col("raw"), "$.store['bicycle']", "string")).display()
If a path cannot be found, the result is null of type VariantVal.
+-----------------+
| bicycle |
+-----------------+
| { |
| "color":"red", |
| "price":19.95 |
| } |
+-----------------+
Extract values from variant arrays
To extract elements from arrays, index with brackets. Indices are 0-based.
- SQL
- Python
-- Index elements
SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data
# Index elements
df_variant.select((variant_get(col("raw"), "$.store.fruit[0]", "string")),(variant_get(col("raw"), "$.store.fruit[1]", "string"))).display()
+-------------------+------------------+
| fruit | fruit |
+-------------------+------------------+
| { | { |
| "type":"apple", | "type":"pear", |
| "weight":8 | "weight":9 |
| } | } |
+-------------------+------------------+
If the path cannot be found, or if the array-index is out of bounds, the result is null.
Working with variants in Python
You can extract variants from Spark DataFrames into Python as VariantVal and work with them individually using the toPython and toJson methods.
# toPython
data = [
('{"name": "Alice", "age": 25}',),
('["person", "electronic"]',),
('1',)
]
df_person = spark.createDataFrame(data, ["json"])
# Collect variants into a VariantVal
variants = df_person.select(parse_json(col("json")).alias("v")).collect()
Output the VariantVal as a JSON string:
print(variants[0].v.toJson())
{"age":25,"name":"Alice"}
Convert a VariantVal to a Python object:
# First element is a dictionary
print(variants[0].v.toPython()["age"])
25
# Second element is a List
print(variants[1].v.toPython()[1])
electronic
# Third element is an Integer
print(variants[2].v.toPython())
1
You can also construct VariantVal using the VariantVal.parseJson function.
# parseJson to construct VariantVal's in Python
from pyspark.sql.types import VariantVal
variant = VariantVal.parseJson('{"a": 1}')
Print the variant as a JSON string:
print(variant.toJson())
{"a":1}
Convert the variant to a Python object and print a value:
print(variant.toPython()["a"])
1
Return the schema of a variant
To return the schema of a variant, use the schema_of_variant function (SQL or Python).
- SQL
- Python
-- Return the schema of the variant
SELECT schema_of_variant(raw) FROM store_data;
# Return the schema of the variant
df_variant.select(schema_of_variant(col("raw"))).display()
To return the combined schemas of all variants in a group, use the schema_of_variant_agg function (SQL or Python).
The following examples return the schema and then the combined schema for the example data json_data.
- SQL
- Python
CREATE OR REPLACE TEMP VIEW json_data AS
SELECT '{"name": "Alice", "age": 25}' AS json UNION ALL
SELECT '{"id": 101, "department": "HR"}' UNION ALL
SELECT '{"product": "Laptop", "price": 1200.50, "in_stock": true}';
-- Return the schema
SELECT schema_of_variant(parse_json(json)) FROM json_data;
json_data = [
('{"name": "Alice", "age": 25}',),
('{"id": 101, "department": "HR"}',),
('{"product": "Laptop", "price": 1200.50, "in_stock": true}',)
]
df_item = spark.createDataFrame(json_data, ["json"])
# Return the schema
df_item.select(parse_json(col("json")).alias("v")).select(schema_of_variant(col("v"))).display()
+-----------------------------------------------------------------+
| schema_of_variant(v) |
+-----------------------------------------------------------------+
| OBJECT<age: BIGINT, name: STRING> |
| OBJECT<department: STRING, id: BIGINT> |
| OBJECT<in_stock: BOOLEAN, price: DECIMAL(5,1), product: STRING> |
+-----------------------------------------------------------------+
- SQL
- Python
-- Return the combined schema
SELECT schema_of_variant_agg(parse_json(json)) FROM json_data;
# Return the combined schema
df.select(parse_json(col("json")).alias("v")).select(schema_of_variant_agg(col("v"))).display()
+----------------------------------------------------------------------------------------------------------------------------+
| schema_of_variant(v) |
+----------------------------------------------------------------------------------------------------------------------------+
| OBJECT<age: BIGINT, department: STRING, id: BIGINT, in_stock: BOOLEAN, name: STRING, price: DECIMAL(5,1), product: STRING> |
+----------------------------------------------------------------------------------------------------------------------------+
Flatten variant objects and arrays
The variant_explode table-valued generator function (SQL or Python) can be used to flatten variant arrays and objects.
- SQL
- Python
Because variant_explode is a generator function, you use it as part of the FROM clause rather than in the SELECT list, as in the following examples:
SELECT key, value
FROM store_data,
LATERAL variant_explode(store_data.raw);
SELECT pos, value
FROM store_data,
LATERAL variant_explode(store_data.raw:store.basket[0]);
Use the table-valued function (TVF) DataFrame API to expand the variant into multiple rows:
spark.tvf.variant_explode(parse_json(lit(store_data))).display()
Variant type casting rules
You can store arrays and scalars using VARIANT type. When trying to cast variant types to other types, normal casting rules apply for individual values and fields, with the following additional rules.
variant_get and try_variant_get take type arguments and follow these casting rules.
Source type | Behavior |
|---|---|
| The result is a |
| The |
When inferring type with schema_of_variant or schema_of_variant_agg, functions fall back to VARIANT type rather than STRING type when conflicting types are present that can't be resolved.
- SQL
- Python
Use the try_variant_get function (SQL) to cast:
-- price is returned as a double, not a string
SELECT try_variant_get(raw, '$.store.bicycle.price', 'double') as price FROM store_data
+------------------+
| price |
+------------------+
| 19.95 |
+------------------+
You can also use :: or cast to cast values to supported data types:
-- cast into more complex types
SELECT cast(raw:store.bicycle AS STRUCT<price DOUBLE, color STRING>) bicycle FROM store_data;
-- `::` also supported
SELECT raw:store.bicycle::STRUCT<price DOUBLE, color STRING> bicycle FROM store_data;
+------------------+
| bicycle |
+------------------+
| { |
| "price":19.95, |
| "color":"red" |
| } |
+------------------+
Use the try_variant_get function (Python) to cast:
# price is returned as a double, not a string
df_variant.select(try_variant_get(col("raw"), "$.store.bicycle.price", "double").alias("price"))
+------------------+
| price |
+------------------+
| 19.95 |
+------------------+
Also use the try_variant_get function (SQL or Python) to handle cast failures:
- SQL
- Python
SELECT try_variant_get(
parse_json('{"a" : "c", "b" : 2}'),
'$.a',
'boolean'
)
spark.range(1).select(parse_json(lit('{"a" : "c", "b" : 2}')).alias("v")).select(try_variant_get(col('v'), '$.a', 'boolean')).display()
Variant null rules
Use the is_variant_null function (SQL or Python) to determine if the variant value is a variant null.
- SQL
- Python
Variants can contain two kinds of nulls:
- SQL
NULL: SQLNULLs indicate that the value is missing. These are the sameNULLs as when dealing with structured data. - Variant
NULL: VariantNULLs indicate that the variant explicitly contains aNULLvalue. These are not the same as SQLNULLs, because theNULLvalue is stored in the data.
SELECT
is_variant_null(parse_json(NULL)) AS sql_null,
is_variant_null(parse_json('null')) AS variant_null,
is_variant_null(parse_json('{ "field_a": null }'):field_a) AS variant_null_value,
is_variant_null(parse_json('{ "field_a": null }'):missing) AS missing_sql_value_null
+--------+------------+------------------+----------------------+
|sql_null|variant_null|variant_null_value|missing_sql_value_null|
+--------+------------+------------------+----------------------+
| false| true| true| false|
+--------+------------+------------------+----------------------+
data = [
('null',),
(None,),
('{"field_a" : 1, "field_b" : 2}',)
]
df = spark.createDataFrame(data, ["null_data"])
df.select(parse_json(col("null_data")).alias("v")).select(is_variant_null(col("v"))).display()
+------------------+
|is_variant_null(v)|
+------------------+
| true|
+------------------+
| false|
+------------------+
| false|
+------------------+