Spark Serialized Task is Too LargeΒΆ

If you see the follow error message,

Spark 2.0 and higher versions:

Serialized task XXX:XXX was XXX bytes, which exceeds max allowed: spark.rpc.message.maxSize (XXX bytes).
Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.

Spark 1.6 and older versions:

Serialized task XXX:XXX was XXX bytes, which exceeds max allowed: spark.akka.frameSize (XXX bytes) - reserved (XXX bytes).
Consider increasing spark.akka.frameSize or using broadcast variables for large values.

you can try to change the Spark conf (spark.rpc.message.maxSize for Spark 2.0 and higher versions, spark.akka.frameSize for Spark 1.6 and older versions) when starting the cluster to fix this error.

While tuning the configurations is one option, typically this error message means that you send some large objects from the driver to executors, e.g., call parallelize with a large list, or convert a large R DataFrame to a Spark DataFrame.

If so, we recommend first auditing your code to remove large objects that you use, or leverage broadcast variables instead. If that does not resolve this error, you can increase the partition number to split the large list to multiple small ones to reduce the Spark RPC message size. Here are examples for Scala and Python:

%scala

val largeList = Seq(...) // This is a large list
val partitionNum = 100 // Increase this number if necessary
val rdd = sc.parallelize(largeList, partitionNum)
val ds = rdd.toDS()
%python

largeList = [...] # This is a large list
partitionNum = 100 # Increase this number if necessary
rdd = sc.parallelize(largeList, partitionNum)
ds = rdd.toDS()

For R users, you need to increase the Spark conf spark.default.parallelism to increase the partition number at cluster initialization. Users cannot set this configuration after cluster creation. We are working on adding a SQL configuration to support changing this configuration on the fly.