Deep Learning with Pytorch, core materials

My notes for a few chapters (Ch5-Ch7) of the book “Deep Learning with PyTorch”. Mostly to familiarize myself with the fundamentals.

Ch5 The Mechanics of Learning

The temperature unit conversion example

The data tensors are:

t_c = torch.tensor([0.5, 14.0, 15.0, ..., 21.0])
t_u = torch.tensor([36.7, 55.9, 58.2, ..., 68.4])

we conjecture a linear relationship: t_c = w * t_u + b, and proceed to estimate w and b.

To do so, first define our loss:

def loss_fn(t_p, t_c):
    squred_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

Note that this loss function returns a scalar via averaging. This is MS loss.

Now initialize parameters:

w = torch.ones(()) 
b = torch.zeros(())

note that w is right now written as “tensor(1.)”, and it can take any dimension that is fit. For example, if z is torch.zeros(5), then w + z gives tensor([1., 1., 1., 1., 1.]).

Now we invoke the model:

t_p = model(t_u, w, b)

Since t_u.shape is torch.Size([11]), here the output t_p is also of this shape.

Finally, we evaluate the loss given this initial w & b:

loss = loss_fn(t_p, t_c)

Now we explain the basic idea of gradient descent. We aim to do the following:

w = w - lr * loss_rate_of_change_wrt_w
b = b - lr * loss_rate_of_change_wrt_b

This “loss_rate_of_change_wrt_w” can be written via Chain rule:

d loss_fn / dw = (d loss_fn / d t_p) * (d t_p / d w)

After some steps, we define our gradient function:

def grad_fn(t_u, t_c, t_p, w, b):
    ....
    
    return torck.stack([dloss_dw.sum(), dloss_db.sum()])

Mathematically, we are doing the following calculation:

where stands for model/prediction. note that m is a vector output.

Then, we iterate to fit the model:

def training_loop(params, t_u, t_c):
    w, b = params
    t_p = model(t_u, w, b) #forward pass
    loss = loss_fn(t_p, t_c)
    grad = grad_fn(t_u, t_c, t_p, w, b) #backward pass
    
    params = params - lr * grad
    
    return params

Often needs to normalize inputs. First we do it in a ad hoc way:

t_un = 0.1 * t_u

and then run training loop on normalized input. In the end, given data “t_un”, we can call the trained model to give us predictions:

t_p = model(t_un, *params) #note we use argument unpacking here

recall *params means passing elements of params as individual arguments.

Now we use “autograd” to compute gradients automatically.

def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()
    
params = torch.tensor([1.0, 1.0], requires_grad = True)

loss = loss_fn(model(t_u, *params), t_c)

loss.backward()

at this point, if you run “params.grad”, the output is “tensor([xxx, xxx]). This gives

note that calling backward lead derivatives to accumulate at leaf nodes, not store. Thus, we need to “zero the gradient explicitly” after using it for parameter updates.

Thus the correct code should look like:

def training_loop(n_epochs, lr, params, t_u, t_c):
    for epoch in range(1, n_epochs+1):
        if params.grad is not None:
            params.grad.zero_()
        
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        loss.backward()
        
        with torch.no_grad():
            params -= lr * params.grad
        
    return params

note we are encapsulating the update in a no_grad context. Within this block, autograd looks away.

Also, we are updating params “in place”. This will be modified later when we use the optimizer packages to automate the process, immediately below.

def training_loop(n_epochs, optimizer, params, t_u, t_c):
    for epoch in range(1, n_epochs+1):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        
        optimizer.zero_grad() # zeros the grad
        loss.backward() 
        optimizer.step() # updates value
        
    return params

Note that right now we make an assumption that we have a linear relationship, thus the params is 2-dim. Later in ch6 we will train on the same data and loss function, but we remove this assumption and just feed the algorithm a general nn.

Now, we split datasets to training and validation to avoid overfitting.

n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples) # 20% as validation

shuffled_indices = torch.randperm(n_samples)
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]

val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]

train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u

def training_loop(...):
    # the only thing different is we evaluate validation loss in addition, in every step

The important thing to remember is that we don’t need to keep track of gradients/computational graph when we evaluate params with the validation set, therefore, we could do something like:

...
with torch.no_grad():
    val_t_p = ...
    val_loss = ...
    
optimizer.zero_grad()
train_loss.backward()
optimizer.step()

Sometimes we use the “set_grad_enabled” context:

def calc_forward(t_u, t_c, is_train):
    with torch.set_grad_enabled(is_train):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        
    return loss

Ch6 Using a Neural Network to Fit the Data

solve the same problem via a full NN.

Multilayer NN is made up of compositions of functions:

x_1 = f(w_0 * x + b_0)
x_2 = f(w_1 * x_1 + b_1)
...
y = f(w_n * x_n + b_n)

we need to replace our linear model with a NN unit. torch.nn is the submodule dedicated to nn: it contains the building blocks needed to create all sorts of NN architectures.

The building blocks are called “modules” in Pytorch parlance, or layers in other frameworks.

A Pytorch module is a python class deriving from the nn.Module base class. A module contains Parameter instance as attributes, which are tensors whose values are optimized during training.

A module can also have submodules as attributes, and it will be able to track their parameters as well. Check via “nn.ModuleList” or “nn.ModuleDict”.

All PyTorch-provided subclasses of nn.Module have a “call” method. This allows instantiating an nn.Linear, and call it as if it is a function:

import torch.nn as nn

linear_model = nn.Linear(1,1)
linear_model(t_un_val) # recall t_un_val contains 2 data points

the output is a tensor of size 2 by 1.

calling an instance of nn.Module with a set of arguments ends up calling a method “forward” with the same arguments.

the “forward” method is what executes the computation, but call does other things besides calling “forward”. So try to use call, do not use forward directly:

y = model(x) # good
y = model.forward(x) # bad

The constructor to ““nn.Linear” accepts 3 arguments:

the number of input features
the number of output features
does it include bias (default true)

linear_model = nn.Linear(1,1) # 1 input-dim, 1 output-dim
linear_model(t_un_val)

linear_model.weight # this gives w
linear_model.bias # this gives b

Often we need to batch the inputs and feed it to the model in one go. The input tensor is of size :

x = torch.ones(10,1)
linear_model(x) # gives a tensor of shape (10,1)

So if we are given data as: t_c = [1,2,3,4,5] t_u = [2,3,4,5,6]

then we need to reshape inputs to , where :

t_c = torch.tensor(t_c).unsqueeze(1)
t_u = torch.tensor(t_u).unsqueeze(1)

The resulting t_c is a tensor of shape (5,1). It looks like a 5 by 1 column vector.

The whole implementation looks like this:

def training_loop(n_pochs, optimizer, model, loss_fn, t_u_train, t_u_val, t_c_train, t_c_val):

    for epoch in range(1, n_epochs+1):
        t_p_train = model(t_u_train)
        loss_train = loss_fn(t_p_train, t_c_train)
        
        t_p_val = model(t_u_val)
        loss_val = loss_fn(t_p_val, t_c_val)
        
        optimizer.zero_grad()
        loss_train.backward()
        optimizer.step()

Note that we could modify the code to use nn.MSELoss() as argument for loss_fn, instead of our handwritten code.

Final step, replace this linear model with a multi-layerd NN. The simple way to concatenate modules is through the “nn.Sequential” container:

seq_model = nn.Sequential(nn.Linear(1, 13), nn.Tanh(), nn.Linear(13,1))

We could name the parameters in a certain way to make inspecting parameters easier.

Ch7 Telling Birds from Airplanes: Learning from Images

First we download data CIFAR10, and this is downloaded as a torchvision dataset. Some common methods for datasets are:

len(cifar10)
cifar[4] # calling the __getitem__ method

For image data, we also need the “transforms.ToTensor()” method to turn images into tensors. In the end, the resulting “img_t” has torch.Size([3, 32, 32]), with type torch.float32. Other useful methods include “transforms.Normalize”.

Deep Learning with Pytorch, core materials

Ch5 The Mechanics of Learning

Ch6 Using a Neural Network to Fit the Data

Ch7 Telling Birds from Airplanes: Learning from Images

Add a comment