Run C and C++ code - Databricks

// This is a very simple test program
dbutils.fs.put("dbfs:/tmp/simple.c",
"""
#include <stdio.h>

int main (int argc, char *argv[]) {
  char str[100];

  while (1) {
    if (!fgets(str, 100, stdin)) {
      return 0;
    }
    printf("Hello, %s", str);
  }
}
""", true)

Wrote 182 bytes. res0: Boolean = true

// Verify the program was written over correctly.
dbutils.fs.head("dbfs:/tmp/simple.c")

res1: String = " #include <stdio.h> int main (int argc, char *argv[]) { char str[100]; while (1) { if (!fgets(str, 100, stdin)) { return 0; } printf("Hello, %s", str); } } "

// Copy the file to the local disk of the Spark driver, so it can be compiled.
dbutils.fs.cp("dbfs:/tmp/simple.c", "file:/tmp/simple.c")

res2: Boolean = true

// Delete any previously existing binary if it exists.
dbutils.fs.rm("file:/tmp/simple")

res3: Boolean = true

// Compile the C/C++ code to a binary.
import scala.sys.process._

val compileOutput = "/usr/bin/gcc -o /tmp/simple /tmp/simple.c" !!

warning: there were 1 feature warning(s); re-run with -feature for details import scala.sys.process._ compileOutput: String = ""

// Check for the binary.
display(dbutils.fs.ls("file:/tmp/simple"))


file:/tmp/simple	simple	8720

// Copy the binary to DBFS, so it will be accessible to all Spark worker nodes.
dbutils.fs.cp("file:/tmp/simple", "dbfs:/tmp/simple")

res5: Boolean = true

val numWorkerNodes = 2
sc.parallelize(1 to 2*numWorkerNodes).map(s => dbutils.fs.cp("dbfs:/tmp/simple", "file:/tmp/simple")).count()

numWorkerNodes: Int = 2 res6: Long = 4

val names = sc.parallelize(Seq("Don", "Betty", "Sally"))

names: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[206] at parallelize at <console>:37

val piped = names.pipe(Seq("/tmp/simple"))

piped: org.apache.spark.rdd.RDD[String] = PipedRDD[207] at pipe at <console>:39

piped.collect().map(println(_))

Hello, Don Hello, Betty Hello, Sally res7: Array[Unit] = Array((), (), ())

// The JVM's on Databricks clusters are set to expect UTF-8 encoding.
import java.nio.charset.Charset
Charset.defaultCharset()

import java.nio.charset.Charset res8: java.nio.charset.Charset = US-ASCII

dbutils.fs.rm("dbfs:/tmp/simple.c")

res9: Boolean = true

dbutils.fs.rm("dbfs:/tmp/simple")

res10: Boolean = true

Run C/C++ code on Databricks

Setup: Write/Copy C/C++ code to DBFS.

Step 1: Compile the C/C++ code for the Spark machines.

Step 2: Write the binary to all the Spark worker nodes.

Step 3: Use the pipe transformation to call your program on an input dataset.

Miscellaneous Tip

Cleanup: Delete the temporary files from DBFS.