Accessing Data

This topic describes how to import data, load data using the Spark API, and edit and delete data using Databricks File System - DBFS commands.

Import data

If you have small files on your local machine that you want to analyze with Databricks, you can easily upload them to Databricks File System - DBFS. For simple exploration scenarios you can:

  • Drop files into or browse to files in the Import & Explore Data box on the landing page:
../_images/import-landing.png

For production environments, however, we recommend that you access Databricks File System - DBFS using the CLI or one of the APIs. You can also use a wide variety of Data Sources to import data directly in your notebooks.

Load data

You can read your raw data into Spark directly. For example, if you uploaded a CSV, you can read your data using one of these examples.

Tip

For easier access, we recommend that you create a table. See Databases and Tables for more information.

Scala
val sparkDF = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/state_income-9f7c5.csv")
Python
sparkDF = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/state_income-9f7c5.csv')
R
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
Scala RDD
val rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
Python RDD
rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")

If the data volume is small enough, you can also load this data directly onto the driver node. For example:

Python
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
R
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)

Download to driver

You can use %sh wget <url>/<filename> to download data to the Spark driver node.

Note

The cell output prints Saving to: '<filename>', but the file is actually saved to file:/databricks/driver/<filename>.

Edit data

You cannot edit data directly within Databricks, but you can overwrite a data file using Databricks File System - DBFS commands.

Delete data

To delete data, use the following Databricks Utilities command:

dbutils.fs.rm("dbfs:/FileStore/tables/state_income-9f7c5.csv", true)

Warning

Deleted data cannot be recovered.