Run C and C++ code - Databricks

# This is a very simple test program
dbutils.fs.put("dbfs:/tmp/simple.c",
"""
#include <stdio.h>

int main (int argc, char *argv[]) {
  char str[100];

  while (1) {
    if (!fgets(str, 100, stdin)) {
      return 0;
    }
    printf("Hello, %s", str);
  }
}
""", True)

Wrote 182 bytes. Out[1]: True

# Verify the program was written over correctly.
print dbutils.fs.head("dbfs:/tmp/simple.c")

#include <stdio.h> int main (int argc, char *argv[]) { char str[100]; while (1) { if (!fgets(str, 100, stdin)) { return 0; } printf("Hello, %s", str); } }

# Copy the file to the local disk of the Spark driver, so it can be compiled.
dbutils.fs.cp("dbfs:/tmp/simple.c", "file:/tmp/simple.c")

Out[3]: True

# Delete any previously existing binary if it exists.
dbutils.fs.rm("file:/tmp/simple")

Out[4]: False

# Compile the C/C++ code to a binary.
import os
os.system("/usr/bin/gcc -o /tmp/simple /tmp/simple.c")

Out[5]: 0

# Check for the binary.
display(dbutils.fs.ls("file:/tmp/simple"))


file:/tmp/simple	simple	8720

# Copy the binary to DBFS, so it will be accessible to all Spark worker nodes.
dbutils.fs.cp("file:/tmp/simple", "dbfs:/tmp/simple")

Out[7]: True

import os
import shutil
  
num_worker_nodes = 1

def copyFile(filepath):
  shutil.copyfile("/dbfs%s" % filepath, filepath)
  os.system("chmod u+x %s" % filepath)
  
sc.parallelize(range(0, 2 * (1 + num_worker_nodes))).map(lambda s: copyFile("/tmp/simple")).count()

Out[8]: 4

names = sc.parallelize(["Don", "Betty", "Sally"])

piped = names.pipe("/tmp/simple")

print "test"

test

piped.collect()

Out[12]: [u'Hello, Don', u'Hello, Betty', u'Hello, Sally']

dbutils.fs.rm("dbfs:/tmp/simple.c")

Out[13]: True

dbutils.fs.rm("dbfs:/tmp/simple")

Out[14]: True

Run C/C++ code on Databricks

Setup: Write/Copy C/C++ code to DBFS.

Step 1: Compile the C/C++ code for the Spark machines.

Step 2: Write the binary to all the Spark worker nodes.

Step 3: Use the pipe transformation to call your program on an input dataset.

Miscellaneous Tip

Cleanup: Delete the temporary files from DBFS.