Validate data for fine-tuning runs
This notebook shows how to validate your data and ensure data integrity is upheld while using Mosaic AI Model Training. It also provides guidance on how to estimate costs based on token usage during fine-tuning runs.
This script serves as an ad-hoc utility for you to run independently prior to starting fine-tuning. Its primary function is to validate your data before you invoke the Finetuning API. This script is not meant for use during the training process.
The inputs to this validation script are assumed to be the same or a subset of the Mosaic AI Model Training API accepted inputs like the following:
Install libraries
Instruction fine-tuning
In this section, you set up the parameters for the validation notebook.
The following are the API arguments for fine-tuning tasks:
Argument | Description |
---|---|
model |
Specifies the model to be used for fine-tuning. For example, EleutherAI/gpt-neox-20b |
train_data_path |
The path to the training data. It can be either a Hugging Face dataset, a path to a JSON Lines (.jsonl ) file or a delta table. |
task_type |
Defines the type of task for which the training strategy will be applied. It is either INSTRUCTION_FINETUNE or CONTINUED_PRETRAIN . |
training_duration |
The duration of the training process, expressed in numerical terms with units of training epochs. |
context_length |
Specifies the context length of the model, set to 2048. This determines how many tokens the model considers for each training example. |
The following are temporary data path configuration arguments:
temporary_jsonl_data_path
: Defines a file system path where temporary data related to the training process will be stored. You need to make sure the path is not shared by other users on the cluster because sharing could cause problems.- Environment variables for Hugging Face caches (
HF_DATASETS_CACHE
) are set to'/tmp/'
, directing dataset caching to a temporary directory.
You need to specify context length based on the model. For the latest supported models and their associated context lengths, use the get_models()
function.
See supported models (AWS| Azure)
The following sets up your home
directory:
The following defines the fine-tuning API arguments and uses the temporary_jsonl_data_path
to define the file system path where you store temporary data related to the training process. Environment variables for Hugging Face caches (HF_DATASETS_CACHE
) are set to /tmp/
, which directs dataset caching to a temporary directory.
Data loading
The instruction fine-tuning data needs to have the dictionary format below:
prompt: xxx
response or completion: yyy
Based on FT_API_args.train_data_path
, select an ingestion method from one of the following options:
- A JSONL file which is stored in an object store supported by Composer.
- A Hugging Face dataset ID. For this option, you need to also provide a split.
- A Delta table.
Data quality checks on the dataset
This section of the notebook performs a series of checks on the initial dataset to ensure its quality and expected format. This process ensures that the dataset adheres to the expected structure and contains the necessary keys for further processing. The checks are outlined below.
- The total number of examples in the dataset is printed.
- The first example from the dataset is displayed. This provides a quick glimpse into the data structure and format.
- Data format validation:
- The dataset is expected to consist of dictionary-like objects (for example, key-value pairs).
- A check is performed to validate this structure.
- Key presence validation:
- Allowed prompt and response keys, chat roles are defined in llmfoundry: _ALLOWED_RESPONSE_KEYS and _ALLOWED_PROMPT_KEYS and _ALLOWED_ROLES.
- For prompt response dataset, the script checks for the presence of at least one prompt key and one response key in each example.
- Prompt validation: Each example is checked for the presence of keys defined in _ALLOWED_PROMPT_KEYS. If no valid prompt key is found, it is counted as a format error.
- Response validation: Similarly, each example is checked for the presence of keys defined in _ALLOWED_RESPONSE_KEYS. An absence of a valid response key is also counted as a format error.
- For chat formatted dataset, the script checks if the message content is formatted valid by calling _validate_chat_formatted_example helper function.
If any format errors are found during the checks, they are reported. A summary of errors is printed, categorizing them into types like data_type
(non-dictionary data), missing_prompt
, and missing_response
.
If no errors are found, a congratulatory message is displayed, indicating that all checks have passed successfully.
Token estimation
Tokenize the raw dataset and get some statistics of the tokens. By doing this, you can estimate the overall cost based on a default trainining run duration. You iterate over the dataloader and sum the number of tokens from each batch.
The fine-tuning API internally ingests the dataset and runs tokenization with the selected tokenizer. The output dataset is a collection of samples and each sample is a collection of token IDs represented as integers.
A histogram is generated so you can visualize the distribution of the frequency of token counts in samples in the dataset. The visualization aids in identifying patterns, outliers, and central tendencies in the token distribution.
Continued pretraining
Similar to instruction fine-tuning, you need to specify the following arguments:
Argument | Description |
---|---|
model |
Specifies the model to be used for fine-tuning. For example, EleutherAI/gpt-neox-20b |
train_data_path |
The path to the training data. Currently, only a remote or local path to a collection of .txt files is supported |
task_type |
Defines the type of task for which the training strategy is applied. It is either INSTRUCTION_FINETUNE or CONTINUED_PRETRAIN . |
training_duration |
The duration of the training process, expressed in numerical terms with units of training epochs. |
context_length |
Specifies the context length of the model, set to 2048. This determines how many tokens the model considers for each training example. For continued pretraining, tokens are concatenated to form samples of length equal to context_length |
The following are temporary data path configuration arguments:
- temporary_mds_output_path: Defines a filesystem path where a notebook that's running data can be stored. You need to make sure the path isn't shared by other users on the cluster because sharing could cause problems. For example, you can make it distinguishable by adding your username to the path.
The following defines the fine-tuning API arguments and uses the temporary_mds_output_path
to define the file system path where temporary data related to the training process is stored.
Generate a synthetic dataset. Replace train_data_path with your raw data path in practice.
Ingestion, tokenization and materialization
Continued pre-training accepts a folder of .txt
files as input data. It tokenizes the text fields and materializes them as a streaming dataset of MDS format.
Mosaic AI Model Training uses llmfoundry/scripts/data_prep/convert_text_to_mds.py to download all the .txt
files and convert them to MDS.
This notebook provides two additional approaches using Spark and Dask.
Continued pre-training datasets are normally much larger than instruction fine-tuning, so the tokenization and materialization can be very time consuming.