Run C and C++ code(Scala)

Run C/C++ code on Databricks

This notebook shows how to compile C/C++ code and run it on a Spark cluster in Databricks.

Setup: Write/Copy C/C++ code to DBFS.

Write/Copy your code to DBFS, so that later your code can be copied onto the Spark Driver and compiled there.

For this simple example, the program could have just been written directly to the local disk of the Spark Driver, but copying to DBFS first makes more sense if you have a large number of C/C++ files.

// This is a very simple test program
dbutils.fs.put("dbfs:/tmp/simple.c",
"""
#include <stdio.h>

int main (int argc, char *argv[]) {
  char str[100];

  while (1) {
    if (!fgets(str, 100, stdin)) {
      return 0;
    }
    printf("Hello, %s", str);
  }
}
""", true)
Wrote 182 bytes. res0: Boolean = true
// Verify the program was written over correctly.
dbutils.fs.head("dbfs:/tmp/simple.c")
res1: String = " #include <stdio.h> int main (int argc, char *argv[]) { char str[100]; while (1) { if (!fgets(str, 100, stdin)) { return 0; } printf("Hello, %s", str); } } "

Step 1: Compile the C/C++ code for the Spark machines.

// Copy the file to the local disk of the Spark driver, so it can be compiled.
dbutils.fs.cp("dbfs:/tmp/simple.c", "file:/tmp/simple.c")
res2: Boolean = true
// Delete any previously existing binary if it exists.
dbutils.fs.rm("file:/tmp/simple")
res3: Boolean = true
// Compile the C/C++ code to a binary.
import scala.sys.process._

val compileOutput = "/usr/bin/gcc -o /tmp/simple /tmp/simple.c" !!
warning: there were 1 feature warning(s); re-run with -feature for details import scala.sys.process._ compileOutput: String = ""
// Check for the binary.
display(dbutils.fs.ls("file:/tmp/simple"))
file:/tmp/simplesimple8720
// Copy the binary to DBFS, so it will be accessible to all Spark worker nodes.
dbutils.fs.cp("file:/tmp/simple", "dbfs:/tmp/simple")
res5: Boolean = true

Step 2: Write the binary to all the Spark worker nodes.

val numWorkerNodes = 2
sc.parallelize(1 to 2*numWorkerNodes).map(s => dbutils.fs.cp("dbfs:/tmp/simple", "file:/tmp/simple")).count()
numWorkerNodes: Int = 2 res6: Long = 4

Step 3: Use the pipe transformation to call your program on an input dataset.

val names = sc.parallelize(Seq("Don", "Betty", "Sally"))
names: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[206] at parallelize at <console>:37
val piped = names.pipe(Seq("/tmp/simple"))
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[207] at pipe at <console>:39
piped.collect().map(println(_))
Hello, Don Hello, Betty Hello, Sally res7: Array[Unit] = Array((), (), ())

Miscellaneous Tip

If the Strings above are corrupted, check how you handle encodings in your C/C++ code.

// The JVM's on Databricks clusters are set to expect UTF-8 encoding.
import java.nio.charset.Charset
Charset.defaultCharset()
import java.nio.charset.Charset res8: java.nio.charset.Charset = US-ASCII

Cleanup: Delete the temporary files from DBFS.

dbutils.fs.rm("dbfs:/tmp/simple.c")
res9: Boolean = true
dbutils.fs.rm("dbfs:/tmp/simple")
res10: Boolean = true