Expand and read Zip compressed files

You can use the unzip Bash command to expand Zip (.zip) compressed files or directories of files. The Databricks %sh magic command enables execution of arbitrary Bash code, including the unzip command.

Apache Spark provides native codecs for interacting with compressed Parquet files. Most Parquet files written by Databricks end with .snappy.parquet, indicating they use snappy compression.

Download and unzip the file

Use curl to download the compressed file and then unzip to expand the data. The following example uses a zipped CSV file downloaded from the internet. See Download data from the internet.

Bash
%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /tmp/LoanStats3a.csv.zip
unzip /tmp/LoanStats3a.csv.zip

Move the file to a volume

Now move the expanded file to a Unity Catalog volume:

Python
%sh mv /tmp/LoanStats3a.csv /Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv

In this example, the downloaded data has a comment in the first row and a header in the second. Now that you have moved and expanded the data, use standard options for reading CSV files, for example:

Python
df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")
display(df)

Download and unzip the file​

Move the file to a volume​

Download and unzip the file

Move the file to a volume