Run C and C++ code(Python)

Run C/C++ code on Databricks

This notebook shows how to compile C/C++ code and run it on a Spark cluster in Databricks.

Setup: Write/Copy C/C++ code to DBFS.

Write or copy your code to DBFS, so that later your code can be copied onto the Spark driver and compiled there.

For this simple example, the program could have just been written directly to the local disk of the Spark Driver, but copying to DBFS first makes more sense if you have a large number of C/C++ files.

# This is a very simple test program
dbutils.fs.put("dbfs:/tmp/simple.c",
"""
#include <stdio.h>

int main (int argc, char *argv[]) {
  char str[100];

  while (1) {
    if (!fgets(str, 100, stdin)) {
      return 0;
    }
    printf("Hello, %s", str);
  }
}
""", True)
Wrote 182 bytes. Out[1]: True
# Verify the program was written over correctly.
print dbutils.fs.head("dbfs:/tmp/simple.c")
#include <stdio.h> int main (int argc, char *argv[]) { char str[100]; while (1) { if (!fgets(str, 100, stdin)) { return 0; } printf("Hello, %s", str); } }

Step 1: Compile the C/C++ code for the Spark machines.

# Copy the file to the local disk of the Spark driver, so it can be compiled.
dbutils.fs.cp("dbfs:/tmp/simple.c", "file:/tmp/simple.c")
Out[3]: True
# Delete any previously existing binary if it exists.
dbutils.fs.rm("file:/tmp/simple")
Out[4]: False
# Compile the C/C++ code to a binary.
import os
os.system("/usr/bin/gcc -o /tmp/simple /tmp/simple.c")
Out[5]: 0
# Check for the binary.
display(dbutils.fs.ls("file:/tmp/simple"))
file:/tmp/simplesimple8720
# Copy the binary to DBFS, so it will be accessible to all Spark worker nodes.
dbutils.fs.cp("file:/tmp/simple", "dbfs:/tmp/simple")
Out[7]: True

Step 2: Write the binary to all the Spark worker nodes.

Alternately, you could use init scripts to do this as well, but you'll have to call the DBFS library directly.

import os
import shutil
  
num_worker_nodes = 1

def copyFile(filepath):
  shutil.copyfile("/dbfs%s" % filepath, filepath)
  os.system("chmod u+x %s" % filepath)
  
sc.parallelize(range(0, 2 * (1 + num_worker_nodes))).map(lambda s: copyFile("/tmp/simple")).count()
Out[8]: 4

Step 3: Use the pipe transformation to call your program on an input dataset.

names = sc.parallelize(["Don", "Betty", "Sally"])
piped = names.pipe("/tmp/simple")
print "test"
test
piped.collect()
Out[12]: [u'Hello, Don', u'Hello, Betty', u'Hello, Sally']

Miscellaneous Tip

If the Strings above are corrupted, check how you handle encodings in your C/C++ code or how Python handles encodings.

Cleanup: Delete the temporary files from DBFS.

dbutils.fs.rm("dbfs:/tmp/simple.c")
Out[13]: True
dbutils.fs.rm("dbfs:/tmp/simple")
Out[14]: True