This topic describes how to import data, load data using the Spark API, and edit and delete data using Databricks File System commands.
If you have small files on your local machine that you want to analyze with Databricks, you can easily upload them to Databricks File System. For simple exploration scenarios you can:
Drop files into or browse to files in the Import & Explore Data box on the landing page:
Upload the files in the Create table UI.
For production environments, however, we recommend that you access Databricks File System using the CLI or one of the APIs. You can also use a wide variety of data sources to import data directly in your notebooks.
You can read your raw data into Spark directly. For example, if you uploaded a CSV, you can read your data using one of these examples.
For easier access, we recommend that you create a table. See Databases and Tables for more information.
val sparkDF = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/FileStore/tables/state_income-9f7c5.csv")
sparkDF = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/state_income-9f7c5.csv')
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
- Scala RDD
val rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
- Python RDD
rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
If the data volume is small enough, you can also load this data directly onto the driver node. For example:
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)
You can use
%sh wget <url>/<filename> to download data to the Spark driver node.
The cell output prints
Saving to: '<filename>', but the file is actually saved to
You cannot edit data directly within Databricks, but you can overwrite a data file using Databricks File System commands.
To delete data, use the following Databricks Utilities command:
Deleted data cannot be recovered.