Skip to main content

Expand and read Zip compressed files

You can use the unzip Bash command to expand Zip compressed files or directories of files. If you download or encounter a file or directory ending with .zip, expand the data before continuing.

Apache Spark provides native codecs for interacting with compressed Parquet files. Most Parquet files written by Databricks end with .snappy.parquet, indicating they use snappy compression.

How to unzip data

The Databricks %sh magic command enables execution of arbitrary Bash code, including the unzip command.

The following example uses a zipped CSV file downloaded from the internet. See Download data from the internet.

Use the Databricks Utilities to move files to the ephemeral storage attached to the driver before expanding them.

This code uses curl to download and then unzip to expand the data:

Bash
%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /tmp/LoanStats3a.csv.zip
unzip /tmp/LoanStats3a.csv.zip

Use dbutils to move the expanded file to a Unity Catalog volume, as follows:

Python
dbutils.fs.mv("file:/LoanStats3a.csv", "/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")

In this example, the downloaded data has a comment in the first row and a header in the second. Now that you have moved and expanded the data, use standard options for reading CSV files, as in the following example:

Python
df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")
display(df)