Skip to main content

Work with ORC files

Apache ORC is a columnar file format that provides optimizations for speeding up queries. It is more efficient than CSV or JSON. Databricks supports ORC for both reading and writing with Apache Spark. For more information, see the Apache Spark documentation on ORC Files.

Prerequisites

Databricks does not require additional configuration to use ORC files. However, to stream ORC files, you need Auto Loader.

Configure and use ORC with the DataFrame API

Use the Apache Spark DataFrame API to read and write ORC files when you need full control over schema, partitioning, or write behavior.

Read and write options

Review the following Apache Spark reference articles for supported DataFrame API read and write options.

Read and write ORC files

For example, read data.orc into a DataFrame df and write it to orc_output.

Python
# Read an ORC file into a DataFrame
df = spark.read.format("orc").load("/tmp/data.orc")
df.show()

# Write a DataFrame to ORC format
df.write.format("orc").save("/tmp/orc_output")

# Write with overwrite mode
df.write.format("orc").mode("overwrite").save("/tmp/orc_output")

Read ORC files with schema specification

Specify a schema when reading ORC files to avoid the overhead of schema inference. For example, define a schema with name, age, and city fields and read data.orc into a DataFrame df.

Python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

df = spark.read.format("orc").schema(schema).load("/tmp/data.orc")
df.printSchema()
df.show()

Write partitioned ORC files

Write partitioned ORC files for optimized query performance on large datasets. For example, create a DataFrame df with year, month, name, and amount columns and write it to partitioned_orc partitioned by year and month.

Python
df = spark.createDataFrame(
[
(2023, 1, "Alice", 100),
(2023, 1, "Bob", 200),
(2023, 2, "Alice", 150),
(2024, 1, "Alice", 300),
],
["year", "month", "name", "amount"]
)

# Write partitioned by year and month
df.write.format("orc").partitionBy("year", "month").save("/tmp/partitioned_orc")

Read ORC files using SQL

Use read_files to query ORC files directly from cloud storage using SQL without creating a table. For example, query an ORC file stored in cloud storage using the path to the file and the orc format specifier.

SQL
SELECT * FROM read_files(
's3://<bucket>/<path>/<file>.orc',
format => 'orc'
)

Set ORC compression

Configure ORC compression using the compression option. Supported codecs include none, snappy, zlib, and lzo. For example, write df to compressed_orc using zlib compression, or to snappy_orc using snappy compression.

Python
# Write with zlib compression
df.write.format("orc").option("compression", "zlib").save("/tmp/compressed_orc")

# Write with snappy compression (default)
df.write.format("orc").option("compression", "snappy").save("/tmp/snappy_orc")