CSV Files

Supported options

Reading files

  • path: location of files. Accepts standard Hadoop globbing expressions. To read a directory of CSV files, specify a directory.
  • header: when set to true, the first line of files are used to name columns and are not included in data. All types are assumed to be string. Default value is false.
  • delimiter: the column delimiter. By default ,, but can be set to any character.
  • quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored.
  • escape: by default the escape character is \, but can be set to any character. Escaped quote characters are ignored.
  • parserLib: by default is commons. Can be set to univocity to use that library for CSV parsing.
  • mode: the parsing mode. By default it is PERMISSIVE. Possible values are:
    • PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
    • DROPMALFORMED: drops lines that have fewer or more tokens than expected or tokens which do not match the schema.
    • FAILFAST: aborts with a RuntimeException if any malformed line is encountered.
  • charset: defaults to UTF-8 but can be set to other valid charset names.
  • inferSchema: automatically infer column types. It requires one extra pass over the data and is false by default.
  • comment: skip lines beginning with this character. Default is #. Disable comments by setting this to null.
  • nullValue: a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.
  • dateFormat: a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default it is null, which means try to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().

Writing files

The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts the following options:

  • path: location of files.
  • header: when set to true, the header (from the schema in the DataFrame) is written at the first line.
  • delimiter: the column delimiter. By default ,, but can be set to any character.
  • quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.
  • escape: by default the escape character is \, but can be set to any character. Escaped quote characters are written.
  • nullValue: a string that indicates a null value, nulls in the DataFrame will be written as this string.
  • dateFormat: a string that indicates the date format to use writing dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. If no dateFormat is specified, then yyyy-MM-dd HH:mm:ss.S.
  • codec: compression codec to use when saving. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive short names (bzip2, gzip, lz4, and snappy). Defaults to no compression.
  • quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes.

Examples

These examples use the diamonds dataset available as a Databricks dataset. All you have to do is specify the path as well as any options that you would like.

Read file in any language

This notebook shows how to a read file, display sample data, and print the data schema using Scala, R, Python, and SQL.

Read CSV files notebook example

Specify schema

When the schema of the CSV file is known upfront, you can specify the desired schema to the CSV reader with the schema option.

Read CSV files with a user-specified schema

Verify correctness of the data

When reading CSV files with a user-specified schema, it is possible that the actual data in the files does not match the specified schema. For example, a field containing name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in:

  • PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly
  • DROPMALFORMED: drops lines that contain fields that could not be parsed
  • FAILFAST: aborts the reading if any malformed data is found

To set the mode, use the mode option.

val diamonds_with_wrong_schema_drop_malformed = sqlContext.read.format("csv").option("mode", "PERMISSIVE")

In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can add _corrupt_record column to the schema.

Pitfalls of reading a subset of columns

The behavior of the CSV parser depends on the set of columns that are read. If the user-specified schema is incorrect, the results might differ considerably depending on the subset of columns that is accessed. The notebook below presents the most common pitfalls.

Caveats of reading a subset of columns of a CSV file