This topic describes how to import data, load data using the Spark API, and edit and delete data using Databricks File System - DBFS commands.
If you have small files on your local machine that you want to analyze with Databricks, you can easily upload them to Databricks File System - DBFS. For simple exploration scenarios you can:
- Drop files into or browse to files in the Import & Explore Data box on the landing page:
- Upload the files in the Create table UI.
For production environments, however, we recommend that you access Databricks File System - DBFS using the CLI or one of the APIs. You can also use a wide variety of Data Sources to import data directly in your notebooks.
You can read your raw data into Spark directly. For example, if you uploaded a CSV, you can read your data using one of these examples.
For easier access, we recommend that you create a table. See Databases and Tables for more information.
val sparkDF = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/FileStore/tables/state_income-9f7c5.csv")
sparkDF = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/state_income-9f7c5.csv')
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
- Scala RDD
val rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
- Python RDD
rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
If the data volume is small enough, you can also load this data directly onto the driver node. For example:
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)
You can use
%sh wget <url>/<filename> to download data to the Spark driver node.
The cell output prints
Saving to: '<filename>', but the file is actually saved to
You cannot edit data directly within Databricks, but you can overwrite a data file using Databricks File System - DBFS commands.
To delete data, use the following Databricks Utilities command:
Deleted data cannot be recovered.