Package Cells

Package cells are special cells that get compiled when executed. These cells have no visibility with respect to the rest of the notebook. You may think of them as separate scala files.

This means that only class and object definitions may go inside this cell. You may not have any variable or function definitions lying around by itself. The following cell will not work.

If you wish to use custom classes and/or objects defined within notebooks reliably in Spark, and across notebook sessions, you must use package cells to define those classes.

Unless you use package cells to define classes, you may also come across obscure bugs as follows:

// We define a class
case class TestKey(id: Long, str: String)

defined class TestKey

// we use that class as a key in the group by
val rdd = sc.parallelize(Array((TestKey(1L, "abd"), "dss"), (TestKey(2L, "ggs"), "dse"), (TestKey(1L, "abd"), "qrf")))
rdd.groupByKey().collect

rdd: org.apache.spark.rdd.RDD[(TestKey, String)] = ParallelCollectionRDD[11874] at parallelize at <console>:42 res10: Array[(TestKey, Iterable[String])] = Array((TestKey(2,ggs),CompactBuffer(dse)), (TestKey(1,abd),CompactBuffer(dss)), (TestKey(1,abd),CompactBuffer(qrf)))

What went wrong above? Even though we have two elements for the key TestKey(1L, "abd"), they behaved as two different keys resulting in:

Array[(TestKey, Iterable[String])] = Array(
  (TestKey(2,ggs),CompactBuffer(dse)), 
  (TestKey(1,abd),CompactBuffer(dss)), 
  (TestKey(1,abd),CompactBuffer(qrf)))

Once we define our case class within a package cell, we will not face this issue.

package com.databricks.example

case class TestKey(id: Long, str: String)

Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.

import com.databricks.example

val rdd = sc.parallelize(Array(
  (example.TestKey(1L, "abd"), "dss"), (example.TestKey(2L, "ggs"), "dse"), (example.TestKey(1L, "abd"), "qrf")))
rdd.groupByKey().collect

import com.databricks.example rdd: org.apache.spark.rdd.RDD[(com.databricks.example.TestKey, String)] = ParallelCollectionRDD[11876] at parallelize at <console>:43 res12: Array[(com.databricks.example.TestKey, Iterable[String])] = Array((TestKey(2,ggs),CompactBuffer(dse)), (TestKey(1,abd),CompactBuffer(dss, qrf)))

package x.y.z

val aNumber = 5 // won't work

def functionThatWillNotWork(a: Int): Int = a + 1

Compilation failed.

package x.y.z

object Utils {
  val aNumber = 5 // works!
  def functionThatWillWork(a: Int): Int = a + 1
}

Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.

import x.y.z.Utils

Utils.functionThatWillWork(Utils.aNumber)

import x.y.z.Utils res15: Int = 6

Why did we get the warning: classes defined within packages cannot be redefined without a cluster restart?

Classes that get compiled with the package cells get dynamically injected into Spark's classloader. Currently it's not possible to remove classes from Spark's classloader. Any classes that you define and compile will have precedence in the classloader, therefore once you recompile, it will not be visible to your application.

Well that kind of beats the purpose of iterative notebook development if I have to restart the cluster, right? In that case, you may just rename the package to x.y.z2 during development/fast iteration and fix it once everything works.

One thing to remember with package cells is that it has no visiblity regarding the notebook environment.

The SparkContext will not be defined as sc.
The SQLContext will not be defined as sqlContext.
Did you import a package in a separate cell? Those imports will not be available in the package cell and have to be remade.
Variables imported through %run cells will not be available.

It is really a standalone file that just looks like a cell in a notebook. This means that any function that uses anything that was defined in a separate cell, needs to take that variable as a parameter or the class needs to take it inside the constructor.

package x.y.zpackage

import org.apache.spark.SparkContext

case class IntArray(values: Array[Int])

class MyClass(sc: SparkContext) {
  def sparkSum(array: IntArray): Int = {
    sc.parallelize(array.values).reduce(_ + _)
  }
}

object MyClass {
  def sparkSum(sc: SparkContext, array: IntArray): Int = {
    sc.parallelize(array.values).reduce(_ + _)
  }
}

Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.

import x.y.zpackage._

val array = IntArray(Array(1, 2, 3, 4, 5))

val myClass = new MyClass(sc)
myClass.sparkSum(array)

import x.y.zpackage._ array: x.y.zpackage.IntArray = IntArray([I@27aefd71) myClass: x.y.zpackage.MyClass = x.y.zpackage.MyClass@4261a396 res17: Int = 15

MyClass.sparkSum(sc, array)

res18: Int = 15