read.df

read.df reads in a dataset from a data source as a SparkDataFrame.

Syntax:

  • read.df(“path”, “source”, schema, ...)

Parameters:

  • path: String, file path
  • source: String, data source format, for eg: “json”, “parquet”, or spark packages like “com.databricks.spark.csv”
  • schema: structType, Optional. If none specified, Spark SQL will infer the schema automatically

Output:

  • SparkDataFrame

Guide <http://spark.apache.org/docs/latest/sparkr.html>__ ### Use read.df to read JSON files

Create simple JSON file and read with read.df

%fs rm /tmp/test.json
%fs put /tmp/test.json "{\"string\":\"string1\",\"int\":1}
{\"string\":\"string2\",\"int\":2}
{\"string\":\"string3\",\"int\":3}
"
require(SparkR)

# Read JSON file as SparkDataFrame
jsonData <- read.df("/tmp/test.json", source = "json")
head(jsonData)

Reading CSV Files

require(SparkR)

# Read CSV file as SparkDataFrame
diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                    source = "csv", header="true", inferSchema = "true")
head(diamonds)

Let’s try loading a CSV file with a specified schema.

csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
csvSchema
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                                 source = "csv", header="true", schema = csvSchema)
head(diamondsLoadWithSchema)

Reading Parquet Files

Save diamonds SparkDataFrame as a Parquet file.

require(SparkR)

saveAsParquetFile(diamonds, "/tmp/diamonds.parquet")
# Use read.df to read in Parquet file
parquetDiamonds <- read.df("/tmp/diamonds.parquet", source = "parquet")
head(parquetDiamonds)