CSV file
Examples
These examples use the diamonds dataset. Specify the path to the dataset as well as any options that you would like.
In this section:
Read file in any language
This notebook shows how to read a file, display sample data, and print the data schema using Scala, R, Python, and SQL.
Specify schema
When the schema of the CSV file is known, you can specify the desired schema to the CSV reader with the schema
option.
Verify correctness of the data
When reading CSV files with a specified schema, it is possible that the data in the files does not match the schema. For example, a field containing name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in:
PERMISSIVE
(default): nulls are inserted for fields that could not be parsed correctlyDROPMALFORMED
: drops lines that contain fields that could not be parsedFAILFAST
: aborts the reading if any malformed data is found
To set the mode, use the mode
option.
val diamonds_with_wrong_schema_drop_malformed = sqlContext.read.format("csv").option("mode", "PERMISSIVE")
In the PERMISSIVE
mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can add _corrupt_record
column to the schema.
Pitfalls of reading a subset of columns
The behavior of the CSV parser depends on the set of columns that are read. If the specified schema is incorrect, the results might differ considerably depending on the subset of columns that is accessed. The following notebook presents the most common pitfalls.