hugging-face-dataset-download(Python)

Loading...

Best practice for downloading datasets from Hugging Face to Databricks

This guide provides recommended best practices of using the Hugging Face load_dataset function to download and prepare datasets on Databricks for different sizes of data.

  • Small-Medium (~100GB): set cache_dir to an existing Unity Catalog volume path (uc_volume_path). Alternatively, if you need to optimize for performance, you can set cache_dir to the local disk on your cluster (which is where the elastic disk is mounted), and then copy the dataset to Unity Catalog for persistence.
  • Large (~TB or more): set cache_dir to an existing Unity Catalog volume path (uc_volume_path).

Requirements:

  • Databricks Runtime 13.0 ML or above.
  • A workspace with Unity Catalog enabled. You also need to have the following permissions in order to write data to a Unity Catalog volume:
    • The WRITE VOLUME privilege on the volume you want to upload files to.
    • The USE SCHEMA privilege on the parent schema.
    • The USE CATALOG privilege on the parent catalog.
  • Significant compute resources for downloading large datasets. The large dataset used in this notebook takes more than a day to download.
2

Unity Catalog setup

To learn more about how to use Unity Catalog, see (AWS | Azure | GCP).

The following cell sets the catalog, schema, and volume where your datasets will be written to. You must have USE CATALOG privilege on the catalog, USE_SCHEMA on the schema, and WRITE VOLUME privilege on the volume you want to upload files to. Change the catalog, schema, and volume names in the following cell if necessary.

If you don't have an existing Unity Catalog volume that you can write to, create a new one. See (AWS | Azure | GCP).

4

Know your dataset size

Before downloading a dataset from Hugging Face, it is possible to get the required disk space if the uploader provided the sizes in Hugging Face Hub.

7

Examples where dataset size is provided in datasets.DatasetBuilder.info

9

    ('imdb',): download_size=79.6M, dataset_size=127.0M

    Some datasets require the user to specifiy the configuration name to load it.

    11

      /databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( ('oscar', 'unshuffled_deduplicated_en'): download_size=462.4G, dataset_size=1.2T

      Examples where dataset size is not provided

      13

        ('tatsu-lab/alpaca',): download_size=23.1M, dataset_size=42.1M
        14

          /databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for mc4 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mc4 You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( Dataset size for mc4 is not provided by uploader /root/.cache/huggingface/modules/datasets_modules/datasets/mc4/78f7a2b7e2524fa44ee464ef429d011c365f5fe129283869e7fd76856aacb83a/mc4.py:284: FutureWarning: Dataset 'mc4' is deprecated and will be deleted. Use 'allenai/c4' instead. warnings.warn(

          When the dataset size is not provided in datasets.DatasetBuilder.info, there are ways to estimate the dataset size:

          For example: curl -X GET \ "https://datasets-server.huggingface.co/first-rows?dataset=mc4&config=af&split=train"

          Check available storage space

          With the df -h command, you can check the available disk space.

          17

          Size Used Avail Use% Mounted on 147G 30G 110G 22% / 492K 4.0K 488K 1% /dev 147G 30G 110G 22% /mnt/readonly 206G 11G 185G 6% /local_disk0 29G 20G 9.7G 67% /ttyd 16G 4.0K 16G 1% /dev/shm 6.2G 92K 6.2G 1% /run 5.0M 0 5.0M 0% /run/lock 4.0M 0 4.0M 0% /sys/fs/cgroup 10G 0 10G 0% /Workspace 1.0P 0 1.0P 0% /Volumes 1.0P 0 1.0P 0% /dbfs

          From the output, there are 3 main file systems to note:

          • / (ephemeral file system at instance root)
            • This is the default file system if the file is not stored in other file systems such as on your local drive or Unity Catalog.
            • Example: /root, /tmp are all under /
            • Caveats:
              • The disk at instance root is ephemeral, which means that if the cluster is terminated, the data will be gone.
              • The disk space isn't autoscaled when running out of disk space.
          • Local disk on your cluster (ephemeral elastic disk)
            • If "Enable autoscaling local storage" is selected in the cluster configuration page, the local disk on your cluster would be able to autoscale when the available space is running low. This is enabled on Azure by default. For AWS, see the documentation of cluster creation for more details.
            • Example: If you create a directory and start downloading a huge dataset into this location, the available disk space is expected to be expanded when it has almost no space left.
            • Caveats:
              • The elastic disk is also ephemeral.
              • Although disk autoscaling is supported, it has a limit of scaling to at most 5TB.
          • /Volumes (Unity Catalog volume):
            • Unity Catalog is a unified governance solution for data and AI assets on Databricks. You can use Unity Catalog volumes to store and access files in any format, including structured, semi-structured, and unstructured data. Saving your dataset to a volume allows you to persist your data in a governed way in Databricks. Volumes can also be shared with other users and accessed across clusters. /Volumes is backed by object storage.
            • Example: If you create a volume with the path, /Volumes/main/default/cache, and save the dataset to this location, then you can access it even after the cluster gets terminated or from another cluster. Other users who have access to the volume can also access the dataset.
            • Caveat: The speed of reading from Unity Catalog (such as loading saved datasets or models) can be slower than the previous 2 options because of network overhead.

          Small-medium datasets (~100GB)

          Point cache_dir to a Unity Catalog volume path (starts with /Volumes)

          By default, Databricks recommends saving data to Unity Catalog as it provides a unified data governance solution. You can do so by setting cache_dir to {uc_volume_path}/hf_imdb_cache.

          Using Unity Catalog allows you to persist the dataset for cross-cluster access or future access after this cluster gets terminated.

          21

          (Optional) Point cache_dir to elastic local disk

          If you need to optimize for performance, you can try saving your datasets to the elastic local disk instead of Unity Catalog. This option only works for datasets that can fit in the initial available space of the elastic local disk.

          In this example, the total disk space that imdb has is only ~200MB, so it can easily fit in the elastic disk. From the previous output of df -h, you know that the elastic disk is mounted to /local_disk0 so you can set LOCAL_DISK_MOUNT to that.

          If you use local disk but still need to persist the data, you can then copy the dataset from local disk to a Unity Catalog volume (set persistent_path to {uc_volume_path}/hf_imdb_cache).

          23

          24

            True
            25

            /databricks/python_shell/dbruntime/huggingface_patches/datasets.py:45: UserWarning: The cache_dir for this dataset is /local_disk0/hf_cache/, which is not a persistent path.Therefore, if/when the cluster restarts, the downloaded dataset will be lost.The persistent storage options for this workspace/cluster config are: [DBFS, UC Volumes].Please update either `cache_dir` or the environment variable `HF_DATASETS_CACHE`to be under one of the following root directories: ['/dbfs/', '/Volumes/'] warnings.warn(warning_message)
            26

            '/Volumes/main/default/my-volume/hf_imdb_cache'

            Large datasets (~TB): Unity Catalog volume

            Point cache_dir to a Unity Catalog volume path (starts with /Volumes)

            Since the elastic disk has a maximum scaling limit of 5TB, and the speed of new disk space being attached through autoscaling is not guaranteed to reach the speed of disk space getting consumed by dataset downloading, Databricks recommends saving the dataset to Unity Catalog as it has more capacity.

            In this example, the oscar subset unshuffled_deduplicated_en is expected to occupy ~1.7TB, which can fit the elastic disk once autoscaling takes effect; but in case autoscaling doesn't function or the cluster crashes in the middle, you can set cache_dir to point to a Unity Catalog volume path ({uc_volume_path}/hf_oscar_cache).

            Warning: Significant compute time expected

            The next code block loads a very large dataset to Unity Catalog and is expected to take over a day to run. This means that it will cost a significant amount of compute time and resources. If you want to save your compute resources, feel free to skip this step.

            29

            /databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( /databricks/python_shell/dbruntime/huggingface_patches/datasets.py:127: UserWarning: The dataset would be saved to both local disk and PersistentStorageType.VOLUMES for better performance. warnings.warn(

            Use streaming instead of downloading the dataset

            When the dataset is very large, it can take a long time to download, and it can occupy a lot of disk space. In this case, you can also consider not to download the entire dataset, but use dataset streaming instead.

            31

            /databricks/python_shell/dbruntime/huggingface_patches/datasets.py:45: UserWarning: The cache_dir for this dataset is /root/.cache, which is not a persistent path.Therefore, if/when the cluster restarts, the downloaded dataset will be lost.The persistent storage options for this workspace/cluster config are: [DBFS, UC Volumes].Please update either `cache_dir` or the environment variable `HF_DATASETS_CACHE`to be under one of the following root directories: ['/dbfs/', '/Volumes/'] warnings.warn(warning_message) /databricks/python/lib/python3.11/site-packages/datasets/load.py:1486: FutureWarning: The repository for oscar contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/oscar You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( /databricks/python_shell/dbruntime/huggingface_patches/datasets.py:99: UserWarning: This dataset will be stored in /root/.cache, which has a limited available space of 109.8GB, while the required size is 1.6TB. Set `cache_dir` or the environment variable `HF_DATASETS_CACHE` to be either under `/local_disk0/` to use elastic local disk or one of the available persistent storage options: [DBFS, UC Volumes]. warnings.warn( <class 'datasets.dataset_dict.IterableDatasetDict'>

            As you can see, the type of the dataset loaded with streaming=True is an IterableDatasetDict, which is different from directly downloading it that simply yields a DatasetDict.

            For instructions of how to use a dataset that is streamed rather than downloaded, see the tutorials from Hugging Face of how to use dataset streaming and differences between Dataset and IterableDataset.