
Introduction
Encoder-decoder architecture represents a fundamental neural network design that transforms input sequences into output sequences. This approach works by splitting the problem into two parts. The encoder processes input data and creates a compressed representation. The decoder then uses this representation to generate the desired output.
This architecture completely changed sequence-to-sequence tasks. Machine translation, text summarization, and question-answering systems all became more powerful. Before this breakthrough, these tasks required complex rule-based systems. Neural networks could finally handle variable-length inputs and outputs effectively.
The concept centers on a simple division of labor. Encoders "understand" the input by creating meaningful representations. Decoders "generate" output by interpreting these representations step by step. This separation allows each component to specialize in its specific function.
The 2017 paper "Attention Is All You Need" marked the breakthrough moment for this field. It introduced the Transformer architecture that relies entirely on attention mechanisms. This eliminated the need for recurrent connections that had limited earlier models.

Overview of the transformer architecture. | Source: Attention is all you need.
Real-world applications demonstrate the practical value of encoder-decoder models. Google Translate uses these principles for language translation. ChatGPT’s underlying architecture builds on similar transformer encoder concepts for generating human-like responses.
Encoder vs Decoder
What does an encoder model do?
Encoder models convert input sequences into rich, contextual representations. They "read" and "understand" input text without generating new output. The encoder processes each word while considering its relationship to all other words in the sequence.

Illustration of how the encoder-decoder model works. | Source: Attention? Attention!
BERT represents the most successful encoder-only model. It uses bidirectional processing to see both left and right context simultaneously. This allows BERT to understand words based on their complete surrounding context, not just previous words.

Overall, pre-training and fine-tuning procedures for BERT. | Source: Pre-training of Deep Bidirectional Transformers for
Common encoder applications include classification, named entity recognition, and sentiment analysis. These tasks require deep understanding but don’t need text generation. The encoder creates embeddings that capture semantic meaning for downstream prediction tasks.
What does a decoder model do?
Decoder models generate output sequences token by token. They use autoregressive generation to predict the next word based on previously generated words. Each prediction depends only on tokens that came before it.
GPT models exemplify decoder-only architecture. They employ masking mechanisms that prevent "looking ahead" during training. This ensures the model learns to generate text sequentially, mimicking natural language production.
Decoder applications include text generation, language modeling, and completion tasks. They excel at creative writing, code generation, and conversational AI. The autoregressive nature makes them perfect for open-ended generation tasks.
Context vector & "sequence-to-sequence" intuition
The context vector serves as the “bridge” between encoder and decoder components. This fixed-length representation captures the essential meaning of the entire input sequence. Information flows from input understanding to output generation through this compressed representation.
The seq2seq paradigm enables variable-length input and output tasks. Translation exemplifies this perfectly: "Hello world" becomes "Hola mundo" with different lengths. The context vector allows flexible mapping between sequences of any size, making modern NLP applications possible.
Historical Evolution
Let's briefly see how the encoder-decoder model evolved.
Seq2Seq with RNNs & LSTMs (2014)
The original sequence-to-sequence learning paper by Sutskever, Vinyals, and Le introduced LSTM-based encoder-decoder architecture for machine translation. Their key innovation involved using one LSTM to read input sequences and create fixed-dimensional vector representations. Another LSTM then decoded output sequences from these vectors.
The "reversing input sequences" trick proved crucial for performance improvements. By reversing source sentence word order, the model created more short-term dependencies between input and output. This simple transformation made optimization easier and dramatically improved translation quality.
However, fundamental limitations emerged. Vanishing gradients plagued longer sequences. Sequential processing prevented parallelization during training. The fixed-dimensional context vector became a bottleneck for longer inputs.
The Transformer encoder-decoder breakthrough (2017)
Vaswani et al.'s Transformer architecture eliminated recurrence entirely. Self-attention mechanisms allowed parallel processing of all sequence positions simultaneously. This breakthrough addressed the sequential computation bottleneck that limited earlier models.
Multi-head attention enabled capturing different types of relationships across various representation subspaces. Positional encoding handled sequence order without requiring RNNs. The model achieved superior performance with significantly reduced training time.
Training efficiency improved dramatically. The Transformer reached state-of-the-art results on machine translation tasks after just 12 hours on eight GPUs. This represented a massive acceleration compared to recurrent architectures.
Rise of encoder-only (BERT) and decoder-only (GPT) models
BERT's bidirectional encoder revolutionized understanding tasks through masked language modeling. Unlike previous unidirectional approaches, BERT could see context from both directions simultaneously. This bidirectional processing proved superior for classification and comprehension tasks.
GPT's decoder-only approach focused on generation capabilities. These models demonstrated emergent properties through next-word prediction training. Task-specific architectures gave way to unified approaches that could handle multiple applications.
Currently, decoder-only models dominate the large language model space. Models like GPT Series, ChatGPT, and o-series have shown remarkable versatility in text generation tasks. However, encoder-only models remain valuable for embedding-based applications.
Inside the Transformer Encoder-Decoder
Encoder Stack
The Transformer encoder begins with token embeddings combined with positional encoding. These embeddings pass through multi-head self-attention mechanisms that capture relationships between all positions. Feed-forward networks process each position independently using two linear transformations with ReLU activation.
Residual connections and layer normalization surround each sub-layer. The formula LayerNorm(x + Sublayer(x))
ensures stable training and gradient flow. The original implementation uses N=6
identical layers, with all components outputting dimension d_model=512.
Decoder Stack
The decoder stack mirrors the encoder but adds crucial modifications. Masked self-attention prevents future token access during training. This masking sets attention weights to negative infinity for illegal connections. Encoder-decoder attention (cross-attention) allows decoder positions to attend over all encoder outputs.
The decoder includes the same feed-forward networks and normalization as the encoder. Autoregressive generation ensures each position depends only on previous outputs. This design enables sequential text generation while maintaining training efficiency.
Why Do We Divide by √dk in Scaled Dot-Product Attention?
The scaling factor prevents softmax saturation when dot products become large. Without scaling, high-dimensional vectors produce extreme values that push softmax into regions with tiny gradients.
The mathematical explanation involves variance. When query and key components are independent random variables with variance 1, their dot product has variance dk. Dividing by √dk normalizes this variance to 1, maintaining stable gradient flow during training.
Common Variants: Pre-Norm vs Post-Norm, Sparse attention
Modern architectures experiment with pre-norm placement, applying layer normalization before rather than after sub-layers. This improves training stability in deeper models. Sparse attention patterns reduce computational complexity by limiting attention to specific positions.
Recent innovations include local attention windows and learned sparsity patterns. These modifications maintain model quality while dramatically reducing memory requirements for long sequences.
Building an Encoder-Decoder Transformer in PyTorch
Now, let’s implement an encoder-decoder architecture using PyTorch. I must inform you that I will not implement the entire code base. But I will touch on the core concepts. As such, I will completely ignore the data processing function. But you can find the entire code in this Colab Notebook.
Let’s start by implementing the attention mechanism, i.e., the scaled dot product attention.
The ScaledDotProductAttention
class calculates how much each part of the input should “pay attention” to every other part.
Here's what happens step by step:
- Query-Key matching: The model multiplies queries (Q) with keys (K) to find relationships between different positions in the sequence.
- Scaling: It divides by the square root of d_k (key dimension) to prevent extremely large values that could destabilize training.
- Masking: The attn_mask hides certain positions (like future tokens in language modeling) by setting their scores to negative infinity.
- Attention weights: Softmax converts the scores into probabilities that sum to 1, showing how much attention each position gets.
- Final output: These attention weights are applied to the values (V) to create a weighted combination, producing the final context representation.
The function returns both the attention-weighted context and the attention weights themselves, allowing the model to focus on relevant information while maintaining interpretability.
The MultiHeadAttention
class splits the attention process into several parallel streams.
Each head focuses on different patterns.
Key components:
- Linear layers (W_Q, W_K, W_V): Transform the input into queries, keys, and values for all heads at once
- Reshaping: Splits the data into separate heads using .view() and .transpose()
- Parallel processing: Each head runs scaled dot-product attention independently
- Combining results: Concatenates all head outputs and passes them through a final linear layer
- Residual connection: Adds the original input back to the output
- Layer normalization: Stabilizes training by normalizing the final result
The model processes multiple attention patterns simultaneously. This helps it capture different types of relationships between words or tokens in the sequence.
The PoswiseFeedForwardNet
applies the same neural network to every word position separately.
Key components:
- First convolution (conv1): Expands each position from d_model dimensions to d_ff dimensions (usually larger).
- ReLU activation: Adds non-linearity by zeroing out negative values.
- Second convolution (conv2): Shrinks back down to the original d_model size.
- Transpose operations: Flips dimensions so convolutions work properly on the sequence.
- Residual connection: Adds the original input back to prevent information loss.
- Layer normalization: Keeps the outputs stable for better training.
This network gives each position a chance to transform its representation through a two-layer neural network. The kernel size of 1 means it only looks at one position at a time, not neighboring positions.
The EncoderLayer
stacks two main components to process input sequences.
Assume a two-step filter that first finds relationships between words, then refines each word individually.
Key components:
- Self-attention (enc_self_attn): Uses the same input for queries, keys, and values, letting each word attend to all other words in the sequence
- Feed-forward network (pos_ffn): Processes each position independently after attention
The layer first runs multi-head attention where each word looks at every other word to understand context. Then it passes the results through the position-wise feed-forward network to refine the representations.
Self-attention pattern:
Notice that enc_inputs
appears three times in the attention call. This means each word serves as query, key, and value simultaneously. This lets the model discover which words are most relevant to each other.
The output contains both the processed representations and the attention weights, showing what the model focused on during processing.
The DecoderLayer
combines three components to generate text while looking at both previous outputs and encoder information.
It's like having three different ways to gather information before making a decision.
Key components:
- Self-attention (dec_self_attn): Lets each output position look at previous output positions only.
- Cross-attention (dec_enc_attn): Connects decoder outputs to encoder inputs for context.
- Feed-forward network (pos_ffn): Refines each position independently.
First, the decoder examines what it has generated so far using self-attention. Then it looks at the encoder's understanding of the input through cross-attention. Finally, it processes each position through the feed-forward network.
The second attention uses dec_outputs as queries but enc_outputs as both keys and values. This lets the decoder ask "what from the input is relevant to what I'm generating now?"
The layer returns three attention maps showing where the model focused during self-attention, cross-attention, and the final processed outputs.
The Encoder
class stacks multiple encoder layers to deeply understand input text.
It combines word meanings with position information, then processes through several attention layers.
Key components:
- Word embeddings (src_emb): Converts input tokens into vector representations.
- Position embeddings (pos_emb): Adds location information using sinusoidal patterns.
- Layer stack (layers): Contains multiple encoder layers for deep processing.
First, it combines word and position embeddings to create initial representations. Then it creates a mask to ignore padding tokens during attention.
Each encoder layer processes the sequence sequentially. The output from one layer becomes the input for the next layer.
Attention tracking:
The encoder collects attention weights from every layer in enc_self_attns. This lets you see how the model's focus changes as it processes deeper into the network.
The final output contains both the processed representations and all attention patterns from each layer.
The Decoder
class creates text by looking at what it has generated so far and the encoder's understanding of the input. It prevents cheating by blocking access to future tokens.
Key components:
- Target embeddings (tgt_emb): Converts output tokens into vector representations.
- Position embeddings (pos_emb): Adds location information for output positions.
- Layer stack (layers): Multiple decoder layers for deep generation processing.
The masking strategy:
The decoder uses two types of masks for self-attention:
- Padding mask: Ignores meaningless padding tokens.
- Subsequent mask: Blocks future tokens to prevent looking ahead.
These masks combine to ensure fair generation during training.
Cross-attention setup:
The dec_enc_attn_mask
handles attention between decoder outputs and encoder inputs, masking padding in both sequences.
Layer processing:
Each decoder layer processes the sequence while tracking both self-attention and cross-attention patterns. The decoder collects all attention weights to show how the model focuses during generation.
The output includes processed representations and attention maps from every layer.
Lastly, we will put the encoder and decoder in the transformer enclosure.
The Transformer
class connects all pieces to transform input sequences into output sequences.
Key components:
- Encoder: Processes and understands the input sequence.
- Decoder: Generates the output sequence step by step.
- Projection layer: Converts decoder outputs into vocabulary predictions.
First, the encoder creates rich representations of the input. Then the decoder generates output while attending to both previous outputs and encoder information.
Finally, the projection layer transforms the decoder's hidden states into probability scores for each word in the vocabulary.
Output format:
The model returns flattened logits for easy loss calculation during training. It also provides all attention weights from both encoder and decoder layers.
Attention tracking:
You get three types of attention patterns: how the encoder focuses on input, how the decoder focuses on its own output, and how the decoder attends to the encoder's representations.
This complete architecture enables sequence-to-sequence tasks like translation, summarization, and text generation.
Please check the entire code I have written in this Colab Notebook. It will help to better understand the encoder-decoder architecture and how it works.