
OpenAI released GPT-OS, an open-weight model, after nearly 6 years. The last open-sourced model was GPT-2 in 2019. But there is a difference between open-weight and open-source models.
Open-source gives you the full recipe.
Model code, training pipeline, and sometimes even datasets. Everything is publicly available. This is where transparency lives. It’s reproducible, auditable, and flexible. You can retrain it, inspect every decision, and stay compliant. But it also asks more of you: infrastructure, expertise, maintenance. And in return, it removes ceilings.
Think: Mistral 7B.
Open-weight sits in the middle.
You get the trained model, not how it was made. It’s fast to deploy and cheaper than training from scratch. But it’s opaque. You can fine-tune, but not re-trace. You get a head startbut not full control.
This is where gpt-oss-20B and gpt-oss-120B come into the picture.
You can fine-tune these models with any dataset you want. This approach allows you to customize the model with your own dataset within your own GPU bullet and, most importantly, not having to pay a finetuning fee to API providers.
But finetuning gpt-oss is not that simple. One main reason is that the model is too big to fit into your GPU. OpenAI specifically says that you need to use H100 types of GPUs. I literally tried downloading the weights on an A100 GPU, but it wouldn't work.
But there are workarounds. And the one I am particularly interested in is Unsloth.ai.
Unsloth.ai allows you to instruct-tune or fine-tune an open-weight model for free in Google Colab and Kaggle notebooks. It is a game-changer.
You can essentially test how the works and performs with your dataset for free and then scale it up before putting it into production.
In this tutorial, I will show you how to instruct-tune or fine-tune gpt-oss
with an A100 GPU.

Once the notebook starts we can run !nvidia-smi
to GPU name and details.
To get started with Unsloth, make sure that you download their library first.
You also need to ensure that you are using the latest version of PyTorch as well.
Once the download is complete, you should restart the notebook to avoid any errors in future executions.
Importing Libraries
Unsloth helps load models efficiently. PyTorch handles the AI computations.
Setting Parameters
max_seq_length controls how many words the model can process at once. Think of it like the model's attention span, 2048 tokens is roughly 1500 words.
dtype = None lets the system pick the best number format automatically.
Loading the Model The main function loads a pre-trained gpt-oss model with specific settings:
- model_name: Specifies which model to download. This one has 20 billion parameters.
- load_in_4bit: Compresses the model to use less memory (like zipping a file).
- full_finetuning = False: Uses a memory-efficient training method called LoRA instead of updating all parameters.
This setup is perfect for running large AI models on computers with limited GPU memory. The 4-bit loading trick makes a 20B parameter model fit where it normally wouldn't.
Like I already mentioned, it is impossible to fit gpt-oss in A100 GPU. It specifically required H100.
The code returns both the model and tokenizer.
Implementing LoRA
LoRA Configuration
r = 8
This sets the "rank". How complex the training updates can be. Higher numbers mean more detailed changes but use more memory.
Target Modules
The target_modules
list specifies which parts of the model get trained:
q_proj
,k_proj
,v_proj
,o_proj
: Components of the attention mechanism.gate_proj
,up_proj
,down_proj
: These are parts of the feed-forward layers.
Think of these as specific "knobs" in the AI brain that we're allowed to adjust.
Fine-tuning Parameters
lora_alpha = 16:
Controls how strong the training updates are.lora_dropout = 0
: Prevents overfitting (set to 0 for speed).bias = "none"
: Skips updating bias terms for efficiency.
Memory Optimization
use_gradient_checkpointing = "unsloth"
This saves memory during training by recomputing some calculations.
The random_state = 7
ensures reproducible results.
LoRA lets you train huge models efficiently by only updating small adapter layers instead of the entire model.
Dataset
Setting Up the Conversation
conversation = [{"role": "user", "content": "Solve 3x = 27"}]
Creates a chat format with a math question. The model expects conversations structured like a real chat.
Converting to Model Format
The apply_chat_template
function transforms human-readable text into numbers:
add_generation_prompt
: Tells the model it should respondreturn_tensors="pt"
: Converts to PyTorch format (the model's language)reasoning_effort="low"
: Controls how much step-by-step thinking to show
Device Matching
model_input = model_input.to(model.device)
Moves the input to the same location as the model (GPU or CPU).
Generating the Response
The TextStreamer
shows tokens appearing in real-time, like ChatGPT.
Generation parameters:
max_new_tokens
: Maximum response length.temperature=1.0
: Normal creativity level (higher = more random).top_p=1.0
: Considers all possible next words.
The model processes the math problem and generates a solution, streaming each word as it's created.
The Formatting Function
def formatting_prompts_func(examples):
This function converts raw conversation data into a format the model can learn from.
Processing Conversations
convos = examples["messages"]
Extracts conversation data from the dataset. Each conversation contains multiple messages between user and assistant.
Converting Format
The apply_chat_template
transforms conversations:
tokenize = False
: Keeps text as readable strings instead of numbers.add_generation_prompt = False
: Doesn't add extra prompts since this is training data.
Creating Text List
texts = [tokenizer.apply_chat_template(...) for convo in convos]
Processes each conversation and creates a list of formatted texts.
Loading the Dataset
dataset = load_dataset("AI-MO/NuminaMath-CoT", split="train")
Downloads the NuminaMath dataset, which contains math problems with step-by-step solutions.
This preparation step is crucial. Raw datasets aren't in the right format for training. The function standardizes all conversations so the model can learn patterns consistently.
Standardizing the Dataset
dataset = standardize_gpt(dataset)
This function fixes common formatting issues in conversation datasets.
GPT format is a popular way to store conversations. But different datasets might have slight variations. The standardize function makes everything consistent.
Applying the Formatting Function
dataset = dataset.map(formatting_prompts_func, batched = True)
This applies our formatting function to every conversation in the dataset.
Key details:
map()
: Runs the function on each example.batched = True:
Processes multiple conversations at once for speed.
These two lines transform thousands of messy conversations into clean, uniform training examples. The result is a dataset where every conversation follows the exact same format the model expects during training.
Creating the Trainer
trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset)
SFTTrainer handles the entire training loop. It takes your model, tokenizer, and prepared dataset.
Training Configuration
The SFTConfig contains all training settings:
Batch Settings
per_device_train_batch_size = 1
: Processes one conversation at a time (saves memory)gradient_accumulation_steps = 4:
Combines 4 batches before updating (effective batch size = 4)
Training Duration
max_steps = 30
: Stops after 30 training steps.warmup_steps = 5
: Gradually increases learning rate for stability.
Learning Parameters
learning_rate = 2e-4:
How fast the model learns (0.0002).optim = "adamw_8bit"
: Uses memory-efficient optimizer.weight_decay = 0.01
: Prevents overfitting by shrinking weights.
Other Settings
lr_scheduler_type = "linear"
: Gradually reduces learning rate.logging_steps = 1
: Reports progress every step.output_dir = "outputs"
: Where to save the trained model.
This creates a complete training setup optimized for memory efficiency and stable learning.
Running Training
The train()
method executes all 30 training steps we configured earlier.
Generating Response
This code tests the trained model with a new math problem.
Creating Test Input
messages = [{"role": "user", "content": "solve for x: \sqrt{2x + 5} = x - 1"}]
Sets up a square root equation as a conversation message.
Converting to Model Format
inputs = tokenizer.apply_chat_template(...)
Transforms the math problem into model-readable format:
add_generation_prompt = True:
Signals the model to respond.return_tensors = "pt"
: Returns PyTorch tensors.reasoning_effort = "low"
: Minimal step-by-step reasoning.
Generating Response
model.generate(**inputs, max_new_tokens = 128, streamer = TextStreamer(tokenizer))
Creates the solution:
max_new_tokens = 128:
Limits response length.streamer:
Shows output in real-time.
This tests whether your fine-tuned model can solve math problems better than before training.
The underscore _
ignores the return value since we only want to see the streamed output.
So, that's how you can instruct-tune gpt-oss
with your own custom data.