You can use the
unzip Bash command to expand files or directories of files that have been Zip compressed. If you download or encounter a file or directory ending with
.zip, expand the data before trying to continue.
Apache Spark provides native codecs for interacting with compressed Parquet files. By default, Parquet files written by Databricks end with
.snappy.parquet, indicating they use snappy compression.
%sh magic command enables execution of arbitrary Bash code, including the
The following example uses a zipped CSV file downloaded from the internet. You can also use the Databricks Utilities to move files to the driver volume before expanding them. See Download data from the internet and Databricks Utilities.
The following code uses
curl to download and then
unzip to expand the data:
%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /tmp/LoanStats3a.csv.zip unzip /tmp/LoanStats3a.csv.zip
Use dbutils to move the expanded file back to cloud object storage to allow for parallel reading, as in the following:
In this example, the downloaded data has a comment in the first row and a header in the second. Now that the data has been expanded and moved, use standard options for reading CSV files, as in the following example:
df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("/tmp/LoanStats3a.csv") display(df)