# Implementing GPT-2 Using Hugging Face and Pytorch

Canonical URL: https://www.adaline.ai/blog/implementing-gpt-2-using-huggingface-and-pytorch
LLM text URL: https://www.adaline.ai/blog/implementing-gpt-2-using-huggingface-and-pytorch/llms.txt
Published: 2025-08-01T00:00:00.000Z
Modified: 2025-08-01T20:17:50.451Z
Author: Nilesh Barla
Category: Tips
Visibility: public
Reading time: 15 min
Topics: Tips, Adaline, AI agent observability, agent evals, self-improving agents

## Summary

Step-by-step tutorial to build GPT-2 in a weekend for PMs & Engineers

## Article

# Introduction

[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) marked a turning point in language models. Released in early 2019, it showed that a single transformer-based model could generate coherent paragraphs, answer questions, and even write simple stories. At the time, this felt almost magical. Today, we take large language models (LLMs) for granted. Yet, the lessons from GPT-2 still matter.

Why start with GPT-2?

First, it’s lean.

Unlike modern LLMs that span hundreds of billions of parameters, GPT-2 runs comfortably on a single GPU. That makes it ideal for hands-on learning.

Second, its code and weights are openly available. You can inspect every line, every weight, and every step of fine-tuning. That transparency is rare in AI today.

Third, the core transformer architecture in GPT-2 remains at the heart of models like GPT-4, Llama 2, and beyond. By studying GPT-2’s attention mechanisms, tokenization strategies, and training pipelines, you build the foundation needed to tackle bigger, more complex models.

In this tutorial, we’ll walk through setting up GPT-2 with PyTorch and Hugging Face’s Transformers library. You’ll see how to prepare datasets, fine-tune the model, and generate text. Along the way, you’ll learn best practices that apply to any LLM project—no matter the model size.

# Environment Setup & Device Configuration for GPT-2 Training

Before you write a single line of model code, let’s prepare your workspace. First, install PyTorch and Hugging Face’s Transformers:

```shellscript Installing Dependencies.
! pip install torch transformers
```

With dependencies in place, import the core libraries:

```python Importing Libraries.
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
   GPT2LMHeadModel,
   GPT2Tokenizer,
   GPT2Config,
   TrainingArguments,
   Trainer,
   DataCollatorForLanguageModeling
)

```

Here’s what each import does:

- `torch`: The engine for tensors and GPU acceleration.
- `Dataset`** & **`DataLoader`: Wrap your text data in a class, then batch and shuffle it.
- `GPT2Config`: Define or tweak GPT-2’s architecture without loading weights.
- `GPT2Tokenizer`: Convert raw text into token IDs the model understands.
- `GPT2LMHeadModel`: The pretrained language model, ready for fine-tuning or inference.
- `TrainingArguments`** & **`Trainer`: High-level APIs to manage training loops, checkpoints, and logging.
- `DataCollatorForLanguageModeling`: Dynamically mask or shift tokens for language-model objectives.

Next, choose your device:

```python Device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

This line lets PyTorch detect CUDA GPUs—if available—and fall back to CPU otherwise. Move your model and batches onto `device` to harness hardware acceleration. That’s it. Your environment is now configured to train and run GPT-2.

# Preparing Custom Text Datasets with Hugging Face Tokenizer

First, load your raw text with 🤗 Datasets.

```python Datasets.
from datasets import load_dataset
import os

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
```

This picks GPU when available, else CPU. Next, wrap texts in a PyTorch-friendly class:

```python TextDataset.
class TextDataset(Dataset):
   def __init__(self, texts, tokenizer, max_length=512):
       self.tokenizer = tokenizer
       self.texts = texts
       self.max_length = max_length
  
   def __len__(self):
       return len(self.texts)
  
   def __getitem__(self, idx):
       text = str(self.texts[idx])
       encoding = self.tokenizer(
           text,
           truncation=True,
           padding="max_length",
           max_length=self.max_length,
           return_tensors="pt"
       )
       return {
           "input_ids": encoding["input_ids"].flatten(),
           "attention_mask": encoding["attention_mask"].flatten()
       }

```

Here, every text string converts into fixed-size token IDs and attention masks. Truncation avoids overflow; padding keeps batch shapes uniform.

Finally, load and split your dataset:

```python Prepare Dataset.
def prepare_dataset(dataset_name="wikitext", subset="wikitext-2-raw-v1", tokenizer=None, max_length=512):
   dataset = load_dataset(dataset_name, subset)
   train_texts = [item['text'] for item in dataset['train'] if item['text'].strip()]
   val_texts   = [item['text'] for item in dataset['validation'] if item['text'].strip()]
   train_dataset = TextDataset(train_texts, tokenizer, max_length)
   val_dataset   = TextDataset(val_texts,   tokenizer, max_length)
   return train_dataset, val_dataset

```

This function pulls in WikiText, filters out empty lines, and returns two ready-to-use datasets. You’re set to batch, train, and experiment—no manual token slicing required.

# Initializing GPT-2 Model & Tokenizer: Best Practices

Before you fine-tune GPT-2, it’s essential to initialize the model and tokenizer correctly. These components must stay aligned—especially when special tokens like padding are involved. In GPT-2’s case, a pad_token isn’t set by default, so we map it to the eos_token to prevent training errors during batching and masking.

We wrap all this setup in a function:

```python Setup Model and Tokenizer.
def setup_model_and_tokenizer(model_name="gpt2"):
    """Initialize GPT-2 model and tokenizer"""
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    # Add padding token if missing
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer
```

Next, we define the full training routine using Hugging Face’s `Trainer`:

```python Train GPT2.
def train_gpt2(
    model_name="gpt2",
    dataset_name="wikitext",
    subset="wikitext-2-raw-v1",
    output_dir="./gpt2-finetuned",
    num_epochs=3,
    batch_size=4,
    learning_rate=5e-5,
    max_length=512,
    save_steps=500,
    logging_steps=100
):
    print("Loading model and tokenizer...")
    model, tokenizer = setup_model_and_tokenizer(model_name)

    print("Preparing datasets...")
    train_dataset, val_dataset = prepare_dataset(
        dataset_name, subset, tokenizer, max_length
    )

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False  # GPT-2 uses causal language modeling
    )

    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=learning_rate,
        warmup_steps=100,
        logging_steps=logging_steps,
        save_steps=save_steps,
        eval_steps=save_steps,
        eval_strategy="steps",
        save_total_limit=2,
        prediction_loss_only=True,
        remove_unused_columns=False,
        dataloader_pin_memory=False,
        gradient_accumulation_steps=2,
        fp16=torch.cuda.is_available(),
        report_to=[]
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
    )

    print("Starting training...")
    trainer.train()

    print("Saving model...")
    trainer.save_model()
    tokenizer.save_pretrained(output_dir)

    return model, tokenizer
```

This training pipeline handles evaluation, checkpointing, mixed-precision training, and more. It's flexible enough to scale and simple enough to extend—ideal for both experimentation and production.

# Fine-Tuning GPT-2 on WikiText

With your model, tokenizer, dataset, and training loop ready, it’s time to kick off fine-tuning. GPT-2 was originally trained on a large, diverse corpus, but to make it useful for a specific domain—or just to understand how training dynamics work—you’ll often want to continue training on a smaller, curated dataset.

In this example, we use the WikiText-2 corpus, a clean subset of Wikipedia articles commonly used for language modeling tasks. Fine-tuning on WikiText helps the model adapt its predictions to more formal, structured writing while demonstrating how loss, performance, and generalization evolve.

The training begins with a simple call:

```python Training.
print("Training GPT-2 on WikiText dataset...")
model, tokenizer = train_gpt2(
   model_name="gpt2",
   num_epochs=1,           # Reduced for quick testing
   batch_size=2,           # Keeps memory usage low
   output_dir="./gpt2-wikitext"
)

```

Here, we lower both the epoch count and batch size to make the training script lightweight and reproducible—even on limited hardware. The output_dir parameter tells the script where to save the model and tokenizer checkpoints.

Once training completes, you’ll have a fine-tuned version of GPT-2 saved locally—ready for evaluation, text generation, or integration into a downstream product.

Image: https://a-us.storyblok.com/f/1023026/2472x1426/07bf281652/screenshot-2025-08-01-at-11-47-48-pm.png

# Generating Text Your Custom GPT-2 Model

Once your GPT-2 model is fine-tuned, the next step is putting it to use—by generating text from a custom prompt. This is where language models become interactive, allowing you to test how well your model has learned the patterns in your dataset.

Here's a simple function that does exactly that:

```python Generate Text.
def generate_text(model, tokenizer, prompt, max_length=100, temperature=0.8):
    model.eval()
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=max_length,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text
```

A few important points:

- `temperature` controls randomness. Lower values produce more focused, predictable text; higher values increase creativity.
- `do_sample=True` enables stochastic decoding, making output less repetitive.
- `pad_token_id` ensures the model doesn't crash if input is shorter than max length.

To run the generator:

```python Inference.
prompt = "The future of artificial intelligence"
generated = generate_text(model, tokenizer, prompt)
print(f"\nGenerated text:\n{generated}")
```

This simple block lets you test different prompts and explore how your fine-tuned model responds to real-world queries.

## Conclusion

GPT-2 may be old, but it’s far from irrelevant. Its simplicity makes it the perfect playground for learning how LLMs work under the hood. In this guide, you trained it end-to-end—from raw text to generation. What you’ve learned here applies to larger models too. If you’re a product builder or AI engineer, this isn’t just a tutorial. It’s your blueprint for understanding and shipping smarter AI-powered features.