You can use Databricks notebooks to download data from public URLs to volume storage attached to the driver of your cluster. If you are accessing data from cloud object storage, accessing data directly with Apache Spark provides better results.
Databricks clusters provide general compute, allowing you to run arbitrary code in addtion to Apache Spark commands. Because arbitrary commands execute against the root directory for the cluster rather than the DBFS root, you must move downloaded data to a new location before reading it with Apache Spark.
Some workspace configurations might prevent access to the public internet. Consult your workspace administrator if you need expanded network access.
Databricks does not provide any native tools for downloading data from the internet, but you can use open source tools in supported languages. The following examples use packages for Bash, Python, and Scala to download the same file.
%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /tmp/curl-subway.csv
import urllib urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/tmp/python-subway.csv")
import java.net.URL import java.io.File import org.apache.commons.io.FileUtils FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/tmp/scala-subway.csv"))
Because these files downloaded to the volume storage attached to the driver, use
%sh to see these files, as in the following example:
%sh ls /tmp/
You can use Bash commands to preview the contents of files download this way, as in the following example:
%sh head /tmp/curl-subway.csv
To access data with Apache Spark, move it from its current location. The current location for this data is in ephemeral volume storage that is only visible to the driver. Databricks loads data from file sources in parallel, and so files must be visible to all nodes in the compute environment. While Databricks supports a wide range of external data sources, file-based data access generally assumes access to cloud object storage.
The Databricks Utilities (
dbutils) allow you to move files from volume storage attached to the driver to other locations accessible with the DBFS, including external object storage locations you’ve configured access to. The following example moves data to a directory in the DBFS root, a cloud object storage location configured during initial workspace deployment.