MongoDB Atlas via Spark
This notebook provides a top-level introduction in using Spark with MongoDB, enabling developers and data engineers to bring sophisticated real-time analytics and machine learning to live, operational data.
The following illustrates how to use MongoDB and Spark with an example application that leverages MongoDB's aggregation pipeline to pre-process data within MongoDB ready for use in Databricks. It shows as well how to query and write back to MongoDB for use in applications. This notebook covers:
- How to read data from MongoDB into Spark.
- How to run the MongoDB Connector for Spark as a library in Databricks.
- How to leverage MongoDB's Aggregation Pipeline from within Spark
- How to use the machine learning ALS library in Spark to generate a set of personalized movie recommendations for a given user.
- How to write the results back to MongoDB so they are accessible to applications.
Create a Databricks Cluster and Add the Connector as a Library
- Create a Databricks cluster.
- Navigate to the cluster detail page and select the Libraries tab.
- Click the Install New button.
- Select Maven as the Library Source.
- Use the Search Packages feature, find 'mongo-spark'. This should point to
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
or newer.
- Click Install.
For more info on the MongoDB Spark connector (which now supports structured streaming) see the MongoDB documentation.
Create a MongoDB Atlas Instance
Atlas is a fully managed, cloud-based MongoDB service. We'll use Atlas to test the integration between MongoDb and Spark.
- Sign up for MongoDB Atlas.
- Create an Atlas free tier cluster.
- Enable Databricks clusters to connect to the cluster by adding the external IP addresses for the Databricks cluster nodes to the whitelist in Atlas. For convenience you could (temporarily!!) 'allow access from anywhere', though we recommend to enable network peering for production.
Prep MongoDB with a sample data-set
MongoDB comes with a nice sample data-set that allows to quickly get started. We will use this in the context of this notebook
- In MongoDB Atlas Load the sample data-set once the cluster is up and running.
- You can confirm the presence of the data-set via the Browse Collections button in the Atlas UI.
Update Spark Configuration with the Atlas Connection String
- Note the connect string under the Connect dialog in MongoDB Atlas. It has the form of "mongodb+srv://<username>:<password>@<databasename>.xxxxx.mongodb.net/"
- Back in Databricks in your cluster configuration, under Advanced Options (bottom of page), paste the connection string for both the
spark.mongodb.output.uri
and spark.mongodb.input.uri
variables. Please populate the username and password fields appropriately. This way all the workbooks you are running on the cluster will use this configuration.
- Alternatively you can explicitly set the
option
when calling APIs like: spark.read.format("mongodb").option("spark.mongodb.input.uri", connectionString).load()
. If you configured the variables in the cluster, you don't have to set the option.
MongoDB Atlas via Spark
This notebook provides a top-level introduction in using Spark with MongoDB, enabling developers and data engineers to bring sophisticated real-time analytics and machine learning to live, operational data.
The following illustrates how to use MongoDB and Spark with an example application that leverages MongoDB's aggregation pipeline to pre-process data within MongoDB ready for use in Databricks. It shows as well how to query and write back to MongoDB for use in applications. This notebook covers:
Create a Databricks Cluster and Add the Connector as a Library
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
or newer.For more info on the MongoDB Spark connector (which now supports structured streaming) see the MongoDB documentation.
Create a MongoDB Atlas Instance
Atlas is a fully managed, cloud-based MongoDB service. We'll use Atlas to test the integration between MongoDb and Spark.
Prep MongoDB with a sample data-set
MongoDB comes with a nice sample data-set that allows to quickly get started. We will use this in the context of this notebook
Update Spark Configuration with the Atlas Connection String
spark.mongodb.output.uri
andspark.mongodb.input.uri
variables. Please populate the username and password fields appropriately. This way all the workbooks you are running on the cluster will use this configuration.option
when calling APIs like:spark.read.format("mongodb").option("spark.mongodb.input.uri", connectionString).load()
. If you configured the variables in the cluster, you don't have to set the option.