How to Set up Apache Kafka on Databricks

This topic explains how to set up Apache Kafka on AWS EC2 machines and connect them with Databricks. Following are the high level steps that are required to create a Kafka cluster and connect from Databricks notebooks.

Step 1: Create a new VPC in AWS

  1. When creating the new VPC, set the new VPC CIDR range different than the Databricks VPC CIDR range. For example:

    • Databricks VPC vpc-7f4c0d18 has CIDR IP range 10.205.0.0/16

      ../../_images/databricks-vpc.png
    • New VPC vpc-8eb1faf7 has CIDR IP range 10.10.0.0/16

    ../../_images/new-vpc.png
  2. Create a new internet gateway and attach it to the route table of the new VPC. This allows you to ssh into the EC2 machines that you launch under this VPC.

    1. Create a new internet gateway.

      ../../_images/new-igw.png
    2. Attach it to VPC vpc-8eb1faf7.

      ../../_images/attach-igw.png

Step 2: Launch the EC2 instance in the new VPC

Launch the EC2 instance inside the new VPC vpc-8eb1faf7 created in Step 1.

../../_images/new-ec2.png

Step 3: Install Kafka and ZooKeeper on the new EC2 instance

  1. SSH into the machine with the key pair.

    ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
    
  2. Download Kafka and extract the archive.

    wget http://apache.claz.org/kafka/0.10.2.1/kafka_2.12-0.10.2.1.tgz
    tar -zxf kafka_2.12-0.10.2.1.tgz
    
  3. Start the ZooKeeper process.

    cd kafka_2.12-0.10.2.1
    bin/zookeeper-server-start.sh config/zookeeper.properties
    
  4. Edit the config/server.properties file and set 10.10.143.166 as the private IP of the EC2 node.

    advertised.listeners=PLAINTEXT:/10.10.143.166:9092
    
  5. Start the Kafka broker.

cd kafka_2.12-0.10.2.1
bin/kafka-server-start.sh config/server.properties

Step 4: Peer two VPCs

  1. Create a new peering connection.

    ../../_images/create-peering.png
  2. Add the peering connection into the route tables of your Databricks VPC and new Kafka VPC created in Step 1.

    • In the Kafka VPC, go to the route table and add the route to the Databricks VPC.

      ../../_images/new-vpc-route-table.png
    • In the Databricks VPC, go to the route table and add the route to the Kafka VPC.

      ../../_images/databricks-vpc-route-table.png

For details on VPC peering, refer to VPC Peering.

Step 5: Access the Kafka broker from a notebook

  1. Verify you can reach the EC2 instance running the Kafka broker with telnet.

    ../../_images/telnet-output.png
  2. Create a new topic in the Kafka broker.

    1. SSH to the Kafka broker.

      ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
      
    2. Create a topic from the command line.

      bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wordcount < LICENSE
      
  3. Read data in a notebook.

    import org.apache.spark.sql.functions._
    val kafka = spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", "10.10.143.166:9092")
            .option("subscribe", "wordcount")
            .option("startingOffsets", "earliest")
    display(kafka)
    
    ../../_images/example-data.png

    Example Kafka byte steam