Incompatible Schema in Some Files

Problem

The Spark job fails with an exception like the following while reading Parquet files:

Error in SQL statement: SparkException: Job aborted due to stage failure:
Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0
(TID 868031, 10.111.245.219, executor 31):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
    at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)

Cause

The java.lang.UnsupportedOperationException in this instance is caused by one or more Parquet files written to a Parquet folder with an incompatible schema.

Solution

Find the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled:

spark.read.option("mergeSchema", "true").parquet(path)

or

spark.conf.set("spark.sql.parquet.mergeSchema", "true")
spark.read.parquet(path)

If you do have Parquet files with incompatible schemas, the snippets above will output an error with the name of the file that has the wrong schema.

You can also check if two schemas are compatible by using the merge method. For example, let’s say you have these two schemas:

import org.apache.spark.sql.types._

val struct1 = (new StructType)
  .add("a", "int", true)
  .add("b", "long", false)

val struct2 = (new StructType)
  .add("a", "int", true)
  .add("b", "long", false)
  .add("c", "timestamp", true)

Then you can test if they are compatible:

struct1.merge(struct2).treeString

This will give you:

res0: String =
"root
|-- a: integer (nullable = true)
|-- b: long (nullable = false)
|-- c: timestamp (nullable = true)
"

However, if struct2 has the following incompatible schema:

val struct2 = (new StructType)
  .add("a", "int", true)
  .add("b", "string", false)

Then the test will give you the following SparkException:

org.apache.spark.SparkException: Failed to merge fields 'b' and 'b'. Failed to merge incompatible data types LongType and StringType