Large language models (LLMs)

Preview

These notebooks fine-tune large language models (LLMs) on AI Runtime. They cover parameter-efficient methods like Low-Rank Adaptation (LoRA) and full supervised fine-tuning across libraries including TRL, Unsloth, Axolotl, and LLM Foundry, with models from Qwen2 and Llama to GPT-OSS 120B.

Tutorial	Description
Fine-tune Qwen3-4B model	Full-weight fine-tune the Qwen3-4B model on a single H100 GPU using Transformer Reinforcement Learning (TRL), with BF16 mixed precision and gradient checkpointing for memory-efficient training.
Fine-tune Llama-3.2-3B with Unsloth	Fine-tune Llama-3.2-3B using the Unsloth library.
Fine-tune GPT-OSS 20B	Fine-tune OpenAI's `gpt-oss-20b` model on 8 H100 GPUs using distributed data parallelism and LoRA for parameter-efficient fine-tuning.
Supervised fine-tuning using DeepSpeed and TRL	Use the Serverless GPU Python API to run supervised fine-tuning (SFT) using the TRL library with DeepSpeed ZeRO Stage 3 optimization.
LoRA fine-tuning using Axolotl	Use the Serverless GPU Python API to LoRA fine-tune an Olmo3 7B model using the Axolotl library.
Distributed fine-tune Qwen2-0.5B	Fine-tune the Qwen2-0.5B model using LoRA and Liger Kernels for memory-efficient distributed training with parameter reduction.
Distributed fine-tune Llama-3.2-3B with Unsloth	Fine-tune Llama-3.2-3B using distributed training across multiple GPUs with the Unsloth library for optimized parameter-efficient training.
Fine-tune Llama 3.1 8B with LLM Foundry	Fine-tune the Llama 3.1 8B model using Mosaic LLM Foundry with distributed training strategies and model evaluation.
Fine-tune GPT-OSS 120B with DDP and FSDP	Fine-tune OpenAI's GPT-OSS 120B model using supervised fine-tuning on H100 GPUs with DDP and FSDP distributed training strategies.
Distributed training with PyTorch FSDP	Train Transformer models using PyTorch Fully Sharded Data Parallel (FSDP) to shard model parameters across multiple GPUs.

Tutorial	Description
Fine-tune Qwen3-4B model	Full-weight fine-tune the Qwen3-4B model on a single H100 GPU using Transformer Reinforcement Learning (TRL), with BF16 mixed precision and gradient checkpointing for memory-efficient training.
Fine-tune Llama-3.2-3B with Unsloth	Fine-tune Llama-3.2-3B using the Unsloth library.
Fine-tune GPT-OSS 20B	Fine-tune OpenAI's `gpt-oss-20b` model on 8 H100 GPUs using distributed data parallelism and LoRA for parameter-efficient fine-tuning.
Supervised fine-tuning using DeepSpeed and TRL	Use the Serverless GPU Python API to run supervised fine-tuning (SFT) using the TRL library with DeepSpeed ZeRO Stage 3 optimization.
LoRA fine-tuning using Axolotl	Use the Serverless GPU Python API to LoRA fine-tune an Olmo3 7B model using the Axolotl library.
Distributed fine-tune Qwen2-0.5B	Fine-tune the Qwen2-0.5B model using LoRA and Liger Kernels for memory-efficient distributed training with parameter reduction.
Distributed fine-tune Llama-3.2-3B with Unsloth	Fine-tune Llama-3.2-3B using distributed training across multiple GPUs with the Unsloth library for optimized parameter-efficient training.
Fine-tune Llama 3.1 8B with LLM Foundry	Fine-tune the Llama 3.1 8B model using Mosaic LLM Foundry with distributed training strategies and model evaluation.
Fine-tune GPT-OSS 120B with DDP and FSDP	Fine-tune OpenAI's GPT-OSS 120B model using supervised fine-tuning on H100 GPUs with DDP and FSDP distributed training strategies.
Distributed training with PyTorch FSDP	Train Transformer models using PyTorch Fully Sharded Data Parallel (FSDP) to shard model parameters across multiple GPUs.

Video demo

This video walks through the Fine-tune Llama-3.2-3B with Unsloth example notebook in detail (12 minutes).

Video demo​

Video demo