Intro to Transformers

From Attention to Innovation: Understanding Transformers in Machine Learning 🚀

Have you ever wondered how AI models like ChatGPT, Bard, or Claude can generate text that feels human? Behind these marvels lies a revolutionary architecture called the Transformer — the engine powering today’s large language models (LLMs). If you’ve dabbled in Python and machine learning but feel lost in the alphabet soup of LLMs, this post is your guide.

Why do Transformers Matter?

Before Transformers, we had RNNs and LSTMs models that processed words one after another like reading a novel one word at a time. Transformers said, “What if I could read the whole sentence at once and still understand context?”

Imagine trying to understand a book by looking through a straw — that’s an RNN. Transformers give you the full page view.

What is a Transformer?

At its core, a Transformer is a model that pays attention to different parts of input data, like a person scanning a paragraph and focusing on the most meaningful words.

Key idea: Instead of processing data sequentially, a transformer looks at the entire input at once and uses something called **self-attention** to figure out which parts are most relevant.

Breaking Down the Transformer

1. Input Embedding

We can’t feed raw words into a model; we convert them to vectors. Think of this as turning “I am studying transformers” into numerical Lego blocks the model can understand.

2. Positional Encoding

Unlike RNNs, Transformers don’t process words in order, so we give each word its position using sinusoidal math magic. It’s like tagging each Lego block with where it fits.

3. Self-Attention

This is the heart of the transformer.

Let’s say you’re reading the sentence: “The bat flew over the field.”

Do we mean a flying mammal or a baseball bat? The word “flew” gives us a clue. Self-attention allows the model to focus on relevant words like “flew” to understand the meaning of “bat.”

Mathematically, self-attention computes weights (called attention scores) showing how much one word should pay attention to every other word in the sentence.

4. Multi-Head Attention

Instead of one attention view, we create multiple “heads” that look at the data from different perspectives, like having multiple readers highlight different key parts of a paragraph.

5. Feedforward Neural Network

After attention, we pass the output through a standard neural network layer to add some transformation power.

6. Residual Connections + Layer Normalization

These help stabilize training and preserve information by adding shortcuts (like teleporters!) and keeping values normalized.

Code Time! Tiny Transformer Block in TensorFlow

Here’s a simplified version of a Transformer block using Tensorflow.

    import tensorflow as tf
    from tensorflow.keras import layers
    
    class TransformerBlock(tf.keras.layers.Layer):
        def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
            super(TransformerBlock, self).__init__()
            self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
            self.ffn = tf.keras.Sequential([
                layers.Dense(ff_dim, activation="relu"),  # First dense layer
                layers.Dense(embed_dim)                   # Project back to embedding dim
            ])
            self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
            self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
            self.dropout1 = layers.Dropout(rate)
            self.dropout2 = layers.Dropout(rate)
    
        def call(self, inputs, training):
            # Self-attention
            attn_output = self.att(inputs, inputs)
            attn_output = self.dropout1(attn_output, training=training)
            out1 = self.layernorm1(inputs + attn_output)
    
            # Feed-forward network
            ffn_output = self.ffn(out1)
            ffn_output = self.dropout2(ffn_output, training=training)
            return self.layernorm2(out1 + ffn_output)
    
    # Example usage
    embed_dim = 64
    num_heads = 4
    ff_dim = 128  # Feed-forward network dimension
    sequence_length = 10
    batch_size = 2
    
    # Create dummy input: shape (batch_size, sequence_length, embed_dim)
    sample_input = tf.random.uniform((batch_size, sequence_length, embed_dim))
    
    # Initialize the Transformer block
    transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
    
    # Forward pass
    output = transformer_block(sample_input, training=False)
    
    print(output.shape)  # Output shape: (2, 10, 64)

What’s Happening

MultiHeadAttention handles self-attention
Two LayerNormalization layers stabilize training
A feedforward network transforms the features after attention.
We apply residual connections: output = input + transformed_output
Dropout adds regularization during training

Takeaway: Why You Should Care

Transformers revolutionized AI by replacing recurrence with attention, enabling models to learn context better, parallelize training, and scale to billions of parameters.

Whether you’re building chatbots, summarizing articles, or writing code with AI, understanding transformers is your first step into the world of generative intelligence.

TL;DR

Transformers process data in parallel and use attention to understand context
The self-attention mechanism allows each word to focus on others that matter
Their modular design makes them powerful, flexible, and scalable
You can start experimenting with transformer blocks using PyTorch or TensorFlow