Best practice for downloading datasets from Hugging Face to Databricks

This guide provides recommended best practices of using the Hugging Face load_dataset function to download and prepare datasets on Databricks for different sizes of data.

Small-Medium (~100GB): set cache_dir to an existing Unity Catalog volume path (uc_volume_path). Alternatively, if you need to optimize for performance, you can set cache_dir to the local disk on your cluster (which is where the elastic disk is mounted), and then copy the dataset to Unity Catalog for persistence.
Large (~TB or more): set cache_dir to an existing Unity Catalog volume path (uc_volume_path).

Requirements:

Databricks Runtime 13.0 ML or above.
A workspace with Unity Catalog enabled. You also need to have the following permissions in order to write data to a Unity Catalog volume:
- The WRITE VOLUME privilege on the volume you want to upload files to.
- The USE SCHEMA privilege on the parent schema.
- The USE CATALOG privilege on the parent catalog.
Significant compute resources for downloading large datasets. The large dataset used in this notebook takes more than a day to download.

2

Unity Catalog setup

To learn more about how to use Unity Catalog, see (AWS | Azure | GCP).

The following cell sets the catalog, schema, and volume where your datasets will be written to. You must have USE CATALOG privilege on the catalog, USE_SCHEMA on the schema, and WRITE VOLUME privilege on the volume you want to upload files to. Change the catalog, schema, and volume names in the following cell if necessary.

If you don't have an existing Unity Catalog volume that you can write to, create a new one. See (AWS | Azure | GCP).

4

7

9

('imdb',): download_size=79.6M, dataset_size=127.0M

11

/databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( ('oscar', 'unshuffled_deduplicated_en'): download_size=462.4G, dataset_size=1.2T

13

('tatsu-lab/alpaca',): download_size=23.1M, dataset_size=42.1M

14

/databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for mc4 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mc4 You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( Dataset size for mc4 is not provided by uploader /root/.cache/huggingface/modules/datasets_modules/datasets/mc4/78f7a2b7e2524fa44ee464ef429d011c365f5fe129283869e7fd76856aacb83a/mc4.py:284: FutureWarning: Dataset 'mc4' is deprecated and will be deleted. Use 'allenai/c4' instead. warnings.warn(

When the dataset size is not provided in datasets.DatasetBuilder.info, there are ways to estimate the dataset size:

If Hugging Face Hub contains the actual files, the website would show the file sizes, for example: https://huggingface.co/datasets/tatsu-lab/alpaca/tree/main/data .
If Hugging Face Hub only contains the scripts to download the files (example: https://huggingface.co/datasets/mc4/tree/main), check the home page for how many rows are in the dataset, and click on the "API" button to get the curl command for downloading the first rows to estimate the average size of each row.

For example: curl -X GET \ "https://datasets-server.huggingface.co/first-rows?dataset=mc4&config=af&split=train"

17

Size Used Avail Use% Mounted on 147G 30G 110G 22% / 492K 4.0K 488K 1% /dev 147G 30G 110G 22% /mnt/readonly 206G 11G 185G 6% /local_disk0 29G 20G 9.7G 67% /ttyd 16G 4.0K 16G 1% /dev/shm 6.2G 92K 6.2G 1% /run 5.0M 0 5.0M 0% /run/lock 4.0M 0 4.0M 0% /sys/fs/cgroup 10G 0 10G 0% /Workspace 1.0P 0 1.0P 0% /Volumes 1.0P 0 1.0P 0% /dbfs

From the output, there are 3 main file systems to note:

/ (ephemeral file system at instance root)
- This is the default file system if the file is not stored in other file systems such as on your local drive or Unity Catalog.
- Example: /root, /tmp are all under /
- Caveats:
  - The disk at instance root is ephemeral, which means that if the cluster is terminated, the data will be gone.
  - The disk space isn't autoscaled when running out of disk space.
Local disk on your cluster (ephemeral elastic disk)
- If "Enable autoscaling local storage" is selected in the cluster configuration page, the local disk on your cluster would be able to autoscale when the available space is running low. This is enabled on Azure by default. For AWS, see the documentation of cluster creation for more details.
- Example: If you create a directory and start downloading a huge dataset into this location, the available disk space is expected to be expanded when it has almost no space left.
- Caveats:
  - The elastic disk is also ephemeral.
  - Although disk autoscaling is supported, it has a limit of scaling to at most 5TB.
/Volumes (Unity Catalog volume):
- Unity Catalog is a unified governance solution for data and AI assets on Databricks. You can use Unity Catalog volumes to store and access files in any format, including structured, semi-structured, and unstructured data. Saving your dataset to a volume allows you to persist your data in a governed way in Databricks. Volumes can also be shared with other users and accessed across clusters. /Volumes is backed by object storage.
- Example: If you create a volume with the path, /Volumes/main/default/cache, and save the dataset to this location, then you can access it even after the cluster gets terminated or from another cluster. Other users who have access to the volume can also access the dataset.
- Caveat: The speed of reading from Unity Catalog (such as loading saved datasets or models) can be slower than the previous 2 options because of network overhead.

21

(Optional) Point `cache_dir` to elastic local disk

If you need to optimize for performance, you can try saving your datasets to the elastic local disk instead of Unity Catalog. This option only works for datasets that can fit in the initial available space of the elastic local disk.

In this example, the total disk space that imdb has is only ~200MB, so it can easily fit in the elastic disk. From the previous output of df -h, you know that the elastic disk is mounted to /local_disk0 so you can set LOCAL_DISK_MOUNT to that.

If you use local disk but still need to persist the data, you can then copy the dataset from local disk to a Unity Catalog volume (set persistent_path to {uc_volume_path}/hf_imdb_cache).

23

24

True

25

/databricks/python_shell/dbruntime/huggingface_patches/datasets.py:45: UserWarning: The cache_dir for this dataset is /local_disk0/hf_cache/, which is not a persistent path.Therefore, if/when the cluster restarts, the downloaded dataset will be lost.The persistent storage options for this workspace/cluster config are: [DBFS, UC Volumes].Please update either `cache_dir` or the environment variable `HF_DATASETS_CACHE`to be under one of the following root directories: ['/dbfs/', '/Volumes/'] warnings.warn(warning_message)

26

'/Volumes/main/default/my-volume/hf_imdb_cache'

Point `cache_dir` to a Unity Catalog volume path (starts with `/Volumes`)

Since the elastic disk has a maximum scaling limit of 5TB, and the speed of new disk space being attached through autoscaling is not guaranteed to reach the speed of disk space getting consumed by dataset downloading, Databricks recommends saving the dataset to Unity Catalog as it has more capacity.

In this example, the oscar subset unshuffled_deduplicated_en is expected to occupy ~1.7TB, which can fit the elastic disk once autoscaling takes effect; but in case autoscaling doesn't function or the cluster crashes in the middle, you can set cache_dir to point to a Unity Catalog volume path ({uc_volume_path}/hf_oscar_cache).

Warning: Significant compute time expected

The next code block loads a very large dataset to Unity Catalog and is expected to take over a day to run. This means that it will cost a significant amount of compute time and resources. If you want to save your compute resources, feel free to skip this step.

29

/databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( /databricks/python_shell/dbruntime/huggingface_patches/datasets.py:127: UserWarning: The dataset would be saved to both local disk and PersistentStorageType.VOLUMES for better performance. warnings.warn(

31

/databricks/python_shell/dbruntime/huggingface_patches/datasets.py:45: UserWarning: The cache_dir for this dataset is /root/.cache, which is not a persistent path.Therefore, if/when the cluster restarts, the downloaded dataset will be lost.The persistent storage options for this workspace/cluster config are: [DBFS, UC Volumes].Please update either `cache_dir` or the environment variable `HF_DATASETS_CACHE`to be under one of the following root directories: ['/dbfs/', '/Volumes/'] warnings.warn(warning_message) /databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( /databricks/python_shell/dbruntime/huggingface_patches/datasets.py:99: UserWarning: This dataset will be stored in /root/.cache, which has a limited available space of 109.8GB, while the required size is 1.6TB. Set `cache_dir` or the environment variable `HF_DATASETS_CACHE` to be either under `/local_disk0/` to use elastic local disk or one of the available persistent storage options: [DBFS, UC Volumes]. warnings.warn( <class 'datasets.dataset_dict.IterableDatasetDict'>

hugging-face-dataset-download(Python)

Best practice for downloading datasets from Hugging Face to Databricks

Unity Catalog setup

Know your dataset size

Examples where dataset size is provided in `datasets.DatasetBuilder.info`

Examples where dataset size is not provided

Check available storage space

Small-medium datasets (~100GB)

Point `cache_dir` to a Unity Catalog volume path (starts with `/Volumes`)

(Optional) Point `cache_dir` to elastic local disk

Large datasets (~TB): Unity Catalog volume

Point `cache_dir` to a Unity Catalog volume path (starts with `/Volumes`)

Warning: Significant compute time expected

Use streaming instead of downloading the dataset

hugging-face-dataset-download(Python)

Best practice for downloading datasets from Hugging Face to Databricks

Unity Catalog setup

Know your dataset size

Examples where dataset size is provided in datasets.DatasetBuilder.info

Examples where dataset size is not provided

Check available storage space

Small-medium datasets (~100GB)

Point cache_dir to a Unity Catalog volume path (starts with /Volumes)

(Optional) Point cache_dir to elastic local disk

Large datasets (~TB): Unity Catalog volume

Point cache_dir to a Unity Catalog volume path (starts with /Volumes)

Warning: Significant compute time expected

Use streaming instead of downloading the dataset

Examples where dataset size is provided in `datasets.DatasetBuilder.info`

Point `cache_dir` to a Unity Catalog volume path (starts with `/Volumes`)

(Optional) Point `cache_dir` to elastic local disk

Point `cache_dir` to a Unity Catalog volume path (starts with `/Volumes`)