read.df

read.df reads in a dataset from a data source as a DataFrame.

Syntax:

  • read.df(sqlContext, “path”, “source”, schema, ...)

Parameters:

  • sqlContext: SQLContext. This is already created for you in the Databricks notebooks, do not recreate!
  • path: String, file path
  • source: String, data source format, for eg: “json”, “parquet”, or spark packages like “com.databricks.spark.csv”
  • schema: structType, Optional. If none specified, Spark SQL will infer the schema automatically

Output:

  • SparkR DataFrame

Guide <http://spark.apache.org/docs/latest/sparkr.html>__ ### Use read.df to read JSON files

Create simple JSON file and read with read.df

%fs rm /tmp/test.json
%fs put /tmp/test.json "{\"string\":\"string1\",\"int\":1}
{\"string\":\"string2\",\"int\":2}
{\"string\":\"string3\",\"int\":3}
"
# Read JSON file as SparkR DataFrame
jsonData <- read.df(sqlContext, "/tmp/test.json", source = "json")
head(jsonData)

Reading CSV Files

# Read CSV file as SparkR DataFrame, using spark-csv package
diamonds <- read.df(sqlContext, "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                    source = "com.databricks.spark.csv", header="true", inferSchema = "true")
head(diamonds)

Let’s try loading a CSV file with a specified schema.

csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
csvSchema
diamondsLoadWithSchema<- read.df(sqlContext, "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                                 source = "com.databricks.spark.csv", header="true", schema = csvSchema)
head(diamondsLoadWithSchema)

Reading Parquet Files

Save diamonds DataFrame as a Parquet file.

saveAsParquetFile(diamonds, "/tmp/diamonds.parquet")
# Use read.df to read in Parquet file
parquetDiamonds <- read.df(sqlContext, "/tmp/diamonds.parquet", source = "parquet")
head(parquetDiamonds)