Neo4j is a native graph database that leverages data relationships as first-class entities. You can connect a Databricks cluster to a Neo4j cluster using the neo4j-spark-connector, which offers Apache Spark APIs for RDD, DataFrame, GraphX, and GraphFrames. The neo4j-spark-connector uses the binary Bolt protocol to transfer data to and from the Neo4j server.
This article describes how to deploy and configure Neo4j, configure Databricks to access Neo4j, and includes a notebook demonstrating usage.
You cannot access this data source from a cluster running Databricks Runtime 7.0 or above because a Neo4j connector that supports Apache Spark 3.0 is not available.
You can deploy Neo4j on various cloud providers.
Change the Neo4j password from the default (you should be prompted when you first access Neo4j) and modify
conf/neo4j.conf to accept remote connections.
# conf/neo4j.conf # Bolt connector dbms.connector.bolt.enabled=true #dbms.connector.bolt.tls_level=OPTIONAL dbms.connector.bolt.listen_address=0.0.0.0:7687 # HTTP Connector. There must be exactly one HTTP connector. dbms.connector.http.enabled=true #dbms.connector.http.listen_address=0.0.0.0:7474 # HTTPS Connector. There can be zero or one HTTPS connectors. dbms.connector.https.enabled=true #dbms.connector.https.listen_address=0.0.0.0:7473
For more information, see Configuring Neo4j Connectors.
If your Neo4j cluster is running in AWS and you want to use private IPs, see the VPC Peering guide.
Create a cluster with these Spark configurations.
spark.neo4j.bolt.url bolt://<ip-of-neo4j-instance>:7687 spark.neo4j.bolt.user <username> spark.neo4j.bolt.password <password>
Import libraries and test the connection.
import org.neo4j.spark._ import org.graphframes._ val neo = Neo4j(sc) // Dummy Cypher query to check connection val testConnection = neo.cypher("MATCH (n) RETURN n;").loadRdd[Long]