Validate data for fine-tuning runs

This notebook shows how to validate your data and ensure data integrity is upheld while using Mosaic AI Model Training. It also provides guidance on how to estimate costs based on token usage during fine-tuning runs.

This script serves as an ad-hoc utility for you to run independently prior to starting fine-tuning. Its primary function is to validate your data before you invoke the Finetuning API. This script is not meant for use during the training process.

The inputs to this validation script are assumed to be the same or a subset of the Mosaic AI Model Training API accepted inputs like the following:

The following are the API arguments for fine-tuning tasks:

Argument	Description
`model`	Specifies the model to be used for fine-tuning. For example, `EleutherAI/gpt-neox-20b`
`train_data_path`	The path to the training data. It can be either a Hugging Face dataset, a path to a JSON Lines (`.jsonl`) file or a delta table.
`task_type`	Defines the type of task for which the training strategy will be applied. It is either `INSTRUCTION_FINETUNE` or `CONTINUED_PRETRAIN`.
`training_duration`	The duration of the training process, expressed in numerical terms with units of training epochs.
`context_length`	Specifies the context length of the model, set to 2048. This determines how many tokens the model considers for each training example.

The following are temporary data path configuration arguments:

temporary_jsonl_data_path: Defines a file system path where temporary data related to the training process will be stored. You need to make sure the path is not shared by other users on the cluster because sharing could cause problems.
Environment variables for Hugging Face caches (HF_DATASETS_CACHE) are set to '/tmp/', directing dataset caching to a temporary directory.

You need to specify context length based on the model. For the latest supported models and their associated context lengths, use the get_models() function. See supported models (AWS| Azure)

Data loading

The instruction fine-tuning data needs to have the dictionary format below:

prompt: xxx
response or completion: yyy

Based on FT_API_args.train_data_path, select an ingestion method from one of the following options:

A JSONL file which is stored in an object store supported by Composer.
A Hugging Face dataset ID. For this option, you need to also provide a split.
A Delta table.

This section of the notebook performs a series of checks on the initial dataset to ensure its quality and expected format. This process ensures that the dataset adheres to the expected structure and contains the necessary keys for further processing. The checks are outlined below.

The total number of examples in the dataset is printed.
The first example from the dataset is displayed. This provides a quick glimpse into the data structure and format.
Data format validation:

The dataset is expected to consist of dictionary-like objects (for example, key-value pairs).
A check is performed to validate this structure.

Key presence validation:

Allowed prompt and response keys, chat roles are defined in llmfoundry: _ALLOWED_RESPONSE_KEYS and _ALLOWED_PROMPT_KEYS and _ALLOWED_ROLES.
For prompt response dataset, the script checks for the presence of at least one prompt key and one response key in each example.
- Prompt validation: Each example is checked for the presence of keys defined in _ALLOWED_PROMPT_KEYS. If no valid prompt key is found, it is counted as a format error.
- Response validation: Similarly, each example is checked for the presence of keys defined in _ALLOWED_RESPONSE_KEYS. An absence of a valid response key is also counted as a format error.
For chat formatted dataset, the script checks if the message content is formatted valid by calling _validate_chat_formatted_example helper function.

If any format errors are found during the checks, they are reported. A summary of errors is printed, categorizing them into types like data_type (non-dictionary data), missing_prompt, and missing_response.

If no errors are found, a congratulatory message is displayed, indicating that all checks have passed successfully.

Similar to instruction fine-tuning, you need to specify the following arguments:

Argument	Description
`model`	Specifies the model to be used for fine-tuning. For example, `EleutherAI/gpt-neox-20b`
`train_data_path`	The path to the training data. Currently, only a remote or local path to a collection of .txt files is supported
`task_type`	Defines the type of task for which the training strategy is applied. It is either `INSTRUCTION_FINETUNE` or `CONTINUED_PRETRAIN`.
`training_duration`	The duration of the training process, expressed in numerical terms with units of training epochs.
`context_length`	Specifies the context length of the model, set to 2048. This determines how many tokens the model considers for each training example. For continued pretraining, tokens are concatenated to form samples of length equal to `context_length`

The following are temporary data path configuration arguments:

temporary_mds_output_path: Defines a filesystem path where a notebook that's running data can be stored. You need to make sure the path isn't shared by other users on the cluster because sharing could cause problems. For example, you can make it distinguishable by adding your username to the path.

Ingestion, tokenization and materialization

Continued pre-training accepts a folder of .txt files as input data. It tokenizes the text fields and materializes them as a streaming dataset of MDS format.

Mosaic AI Model Training uses llmfoundry/scripts/data_prep/convert_text_to_mds.py to download all the .txt files and convert them to MDS.

This notebook provides two additional approaches using Spark and Dask.

Continued pre-training datasets are normally much larger than instruction fine-tuning, so the tokenization and materialization can be very time consuming.

validate-data-estimate-tokens(Python)

Validate data for fine-tuning runs

Install libraries

Instruction fine-tuning

Data loading

Data quality checks on the dataset

Token estimation

Continued pretraining

Ingestion, tokenization and materialization

Token estimation