Understanding Supervised Finetuning (SFT) with GPT in 2025

Note: Please read this Colab Notebook in-conjunction with this blog to get a clear understanding of how to implement SFT with GPT-2 from Hugging Face.

What is Supervised Fine-Tuning (SFT)?

Supervised Fine-Tuning is the foundation of all LLMs that we are using today.

But, what does SFT mean in the context of modern AI development, and why has supervised fine-tuning become the cornerstone of creating intelligent language models like ChatGPT and Claude?

Supervised Fine-Tuning (SFT) is a process of taking a pre-trained language model and further training them on a smaller, task-specific dataset with labeled examples. Its goal is to adjust the weights of the pre-trained model so that it performs better on our specific task without losing its general knowledge acquired during pre-training. This technique has become essential because it transforms raw language models into helpful AI assistants.

The "supervised" aspect means we’re teaching the model using examples of correct behavior. Think of it like tutoring a brilliant student who knows many facts but needs guidance on how to apply that knowledge properly.

For example, if you want a LLM to classify emails into "spam" or "not spam" you would provide it a dataset containing email texts along with their corresponding labels. Then the model learns to map input sequences to correct outputs based on this dataset. The training data follows a specific format:

Prompt: "Write a poem to help me remember the first 10 elements on the periodic table"
Response: [High-quality example response with proper formatting and accuracy].

SFT differs significantly from other training methods.

While SFT is a type of fine-tuning, not all fine-tuning is “supervised.”

In this image, step 1 shows SFT. | Source: Aligning language models to follow instructions

Here's how SFT differs from broader fine-tuning approaches, it uses labeled input-output pairs rather than unlabeled data or reward signals. The standard AI training pipeline follows three stages:

1
Pre-training
2
SFT
3
RLHF (Reinforcement Learning from Human Feedback).

How SFT Works?

SFT transforms raw language models into helpful assistants through a straightforward process. The SFT process typically follows these steps:

1
Dataset curation
2
Fine-tuning
3
Evaluation.

The core difference from pre-training lies in the data used. While pre-training uses massive unlabeled text, SFT employs carefully curated instruction-response pairs.

SFT is not much different from language model pretraining. Both pretraining and SFT use next token prediction as their underlying training objective. However, typically, the next token prediction objective is only applied to the portion of each example that corresponds to the LLM's output. This ensures that the model learns to generate appropriate responses rather than memorizing prompts.

Key Components of SFT

The foundation starts with base model selection. We conduct experiments using GPT-2 from Huggingface.

Dataset structure proves crucial. A smaller dataset relevant to the target task is created. This dataset consists of input-output pairs where each input is associated with a label or response.

Training configuration significantly impacts results. Larger batch sizes, combined with lower learning rates, improve generalization and performance.

SFT Training Pipeline

Data preprocessing involves tokenization and formatting instruction-response pairs. During this process model’s parameters are updated to minimize the difference between its predictions and true labels. After fine-tuning the model is evaluated on a validation set to assess its performance on the target task.

Benefits and Limitations

SFT offers remarkable computational efficiency. SFT is simple to use. The training process and objective are very similar to pretraining. Plus, the approach is highly effective at performing alignment and—relative to pretraining—is computationally cheap (i.e, 100X less expensive, if not more).

However, the results of SFT are heavily dependent upon the dataset we curate, requiring careful manual inspection to ensure quality and diversity.

Setting Up SFT with HuggingFace and PyTorch

Environment Setup

Getting started with SFT requires specific libraries and imports. The essential setup includes:

import.py

Understanding the Hugging Face Trainer

HuggingFace's Trainer class streamlines the entire fine-tuning process.

Key parameters include learning rates (typically 2e-5 for optimal performance), batch sizes, and training epochs. These can be defined in an argument variable and called within the Trainer class. Look at the example below.

trainer.py

The Trainer handles tokenization, loss computation, and optimization automatically, eliminating complex manual implementation.

Data Preparation

Dataset preparation requires careful cleaning and formatting. Here's the standard approach for loading and preprocessing data:

data.py

This preprocessing ensures clean text data without empty entries, essential for effective training.

Model Selection and Loading

Choose base models strategically and load them with proper precision settings:

model.py

Memory optimization techniques like gradient checkpointing and mixed precision training reduce VRAM requirements significantly, enabling larger batch sizes on limited hardware while maintaining training stability.

Hands-On Implementation (Fine-Tuning GPT-2 with PyTorch)

Setting Up Your Fine-Tuning Environment

How do you actually implement GPT fine-tuning using PyTorch and Huggingface, and what are the critical steps that determine training success or failure?

The implementation follows a structured approach that combines environment preparation, data processing, and model configuration. Successful GPT-2 fine-tuning projects typically achieve significant performance improvements over base models when properly configured. The PyTorch implementation requires careful attention to library versions, data formatting, and hardware optimization to ensure stable training runs.

data.py

This data preparation approach addresses common preprocessing challenges.

1
The cleaning function removes problematic empty entries that cause training instability.
2
Filtering operations eliminate zero-length sequences that waste computational resources.

Alternative approaches include custom tokenization schemes and specialized formatting for instruction-following tasks, but this baseline method works reliably across different model architectures and ensures consistent training performance.

Model Configuration and Training Setup

What model configuration and training parameters are essential for successful GPT fine-tuning, and how do you optimize them for your specific use case?

Model selection and configuration determine training efficiency and final performance. The key is balancing model capacity with available computational resources while ensuring stable training dynamics. Fine-tuning AI requires strategic decisions about precision settings, memory allocation, and optimization techniques.

Successful model optimization typically reduces training time by 40-60% through proper configuration while maintaining or improving model quality.

tokenization.py

This tokenization approach optimizes memory usage by efficiently packing text. The group_texts function concatenates sequences to minimize padding waste, improving training efficiency. Right-side padding ensures consistent batch processing across different sequence lengths.

args.py

These parameters reflect research-backed optimization strategies. Mixed precision training (fp16/bf16) reduces memory usage while maintaining numerical stability.

Gradient clipping prevents training instability, which is essential for consistent convergence across different model architectures and datasets.

Best Practices and Production Considerations for SFT in 2025

What are the critical best practices for deploying supervised fine-tuning in production environments, and how do you ensure your fine-tuned models are safe, effective, and cost-efficient in 2025?

The foundation of successful AI fine-tuning lies in prioritizing data quality over quantity. Research from LIMA demonstrates that 1,000 carefully curated examples outperform 10,000 poorly selected ones. In LIMA, authors curate a dataset of only 1K examples for SFT, and the resulting model is quite competitive with top open-source and proprietary LLMs. Production-ready models require datasets with high quality and diversity standards rather than massive volumes of mediocre training data.

Training optimization demands a systematic approach to hyperparameter selection and resource management. Larger batch sizes, combined with lower learning rates, improve generalization and performance. Key optimization practices include:

1
Learning Rate Scheduling
Start with 2e-5 for stable convergence across model architectures.
2
Batch Size Optimization
Use gradient accumulation to achieve effective batch sizes of 3,840-7,680 samples.
3
Early Stopping Criteria
Monitor validation loss plateaus to prevent overfitting and reduce costs.
4
Cost Monitoring
Track GPU hours and implement automatic training termination for budget control.

Future directions suggest hybrid approaches that combine SFT with RLHF for optimal alignment.

Closing

You can find the code for this blog here.

Training old and open-sourced is a great experience. It essentially makes you realize how far AI has come. Also, it provides you with knowledge and teachable moments of how these models work. With the open-source model, you don't have to build a model from scratch, but it provides you with a more engineering aspect of training the model.

GPT-2 training and validation graph. Our SFT results from training GPT-2 on Wikitext dataset.

If you get familiar with how to train the model, you can essentially incorporate similar techniques into other models and datasets.