09 - Convolutional Neural Networks (CNN)

Class: CSCE-421


Notes:

Convolutions

Fully Connected Layer

Pasted image 20260219105125.png500

Convolution Layer

Pasted image 20260219100750.png500

Pasted image 20260219100813.png500

Notes:

Pasted image 20260219101103.png500

Notes:

Pasted image 20260219101206.png500

Notes:

Pasted image 20260219101346.png500

Notes:

Pasted image 20260219101455.png500

Pasted image 20260219101509.png500

f[x,y]g[x,y]=n1=n2=f[n1,n2]g[xn1,yn2]

Notes:

Pasted image 20260219101615.png500

Notes:

A closer look at spatial dimensions:

Pasted image 20260219102031.png450
Pasted image 20260219102110.png450
Pasted image 20260219102154.png450
Pasted image 20260219102207.png450
Pasted image 20260219102224.png450

Notes:

size=input size - filter size2+1

Pasted image 20260219102455.png450
Pasted image 20260219102509.png450
Pasted image 20260219102525.png450
Pasted image 20260219102544.png475
Pasted image 20260219102612.png500

 Output size: (NF)/ stride +1 e.g. N=7,F=3: stride 1(73)/1+1=5 stride 2(73)/2+1=3 stride 3(73)/3+1=2.33

Notes:


Stride

In practice: Common to zero pad the border

Pasted image 20260219102912.png500

Notes:

Pasted image 20260219103225.png500

Notes:

Pasted image 20260219103346.png500

Convolution: translation-equivariance

Pasted image 20260219103443.png400

Pasted image 20260219103600.png500

Convolution = local connection + weight-sharing

Pasted image 20260219103705.png500

Notes:

Convolution: linear transform

Pasted image 20260219104430.png500

Notes:

Receptive Field

Pasted image 20260219105052.png500

Notes:

Examples time:

Input volume: 32×32×3
105×5 filters with stride 1 , pad 2

Pasted image 20260224100140.png200

Output volume size: ?

Number of parameters in this layer?

1x1 convolution layers make perfect sense

Pasted image 20260224101911.png425

Notes:

Pooling Layer

Pasted image 20260224102723.png300

MAX POOLING

Pasted image 20260224102749.png400

Notes:

Fully Connected Layer

How to perform BP in convolution, pooling layers?

Pasted image 20260224103244.png500

Notes:

Optional for CSCE-421

Backpropagation in Convolutional Neural Networks

Backpropagating through Convolutions

Backpropagation with an Inverted Filter (Single Channel)

a b c
d e f
g h i
Filter during convolution
i h g
f e d
c b a
Filter durin backpropagation

Notes:

Mini-batch SGD

Loop:

  1. Sample a batch of data
  2. Forward prop it through the graph (network), get loss
  3. Backprop to calculate the gradients
  4. Update the parameters using the gradient

Activation Functions

tanh(x)

ReLU (Rectified Linear Unit)

Notes:

TLDR: in practice:

Data Preprocessing

Learning Rate in Gradient Descent

W:=WηεWη : learning rate 

Pasted image 20260226093723.png350

Notes:

Normalization

Pasted image 20260226093809.png350

Pasted image 20260226093843.png200

Notes:

Data normalization in machine learning

Pasted image 20260226093919.png600

Notes:

TLDR: IN practice for Images: center only

Not common to normalize variance, to do PCA or whitening

Normalization Modules

Pasted image 20260226094159.png500
Pasted image 20260226094247.png500
Pasted image 20260226095826.png500

Notes:

Batch Normalization

ONE OF THE MOST IMPORTANT TOPICS IN DEEP LEARNING

Batch Normalization

"you want zero-mean unit-variance activations? just make them so."
consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:

x^(k)=x(k)E[x(k)]Var[x(k)]

this is a vanilla differentiable function...

Notes:

"you want zero-mean unit-variance activations? just make them so."

  1. compute the empirical mean and variance independently for each dimension.

    Pasted image 20260226100417.png233

    • Fully connected layer with D units
    • For each sample you will get an ... dimensional vector
    • Get the mean, get the standard deviation, subtract mean and divide standard deviation element-wise
  2. Normalize

x^(k)=x(k)E[x(k)]Var[x(k)]

Notes:

Pasted image 20260226100614.png500

Notes:

Batch Normalization (Math)

Normalize:

x^(k)=x(k)E[x(k)]Var[x(k)]

And then allow the network to squash the range if it wants to:

y(k)=γ(k)x^(k)+β(k)

Note, the network can learn:

γ(k)=Var[x(k)]β(k)=E[x(k)]

to recover the identity mapping.

Notes:

Batch Normalization (Algorithm)

 Input: Values of x over a mini-batch: B={x1m} Parameters to be learned: γ,β Output: {yi=BNγ,β(xi)}μB1mi=1mxi// mini-batch mean σB21mi=1m(xiμB)2// mini-batch variance x^ixiμBσB2+ϵ // normalize yiγx^i+βBNγ,β(xi) // scale and shift 

Note: at test time BatchNorm layer functions differently:

Notes:

...

Batch Normalization: Test Time

Input: x:N×D

μj= (Running) average of values  seen during training 

Learnable params:

γ,β:Dσj2= (Running) average of values  seen during training 

Intermediates:

μ,σ:Dx^:N×Dx^i,j=xi,jμjσj2+ε

Output: y:N×D

yi,j=γjx^i,j+βj

Batch Normalization for ConvNets

Batch Normalization for fully-connected networks

x:N×D Normalize μ,a:1×Dγ,β:1×Dy=γ(xμ)/a+β

Batch Normalization for convolutional networks (Spatial Batchnorm, BatchNorm2D)

x:N×C×H×W Normalize μ,a:1×C×1×1γ,β:1×C×1×1y=γ(xμ)/a+β

Notes:

Layer Normalization

Batch Normalization for fully-connected networks

x:N×D Normalize μ,a:1×Dγ,β:1×Dy=γ(xμ)/a+β

Layer Normalization for fully-connected networks Same behavior at train and test! Can be used in recurrent networks

x:N×D Normalize μ,a:N×1γ,β:1×Dy=γ(xμ)/a+β

Notes:

Early Stopping

Pasted image 20260226103515.png500

Notes:

Regularization: Dropout

In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common

Pasted image 20260226103830.png500

Notes:

How can this possibly be a good idea?

Forces the network to have a redundant representation; Prevents co-adaptation of features
Pasted image 20260226104312.png500

Notes:

Another Interpretation

Pasted image 20260226104505.png150

Notes:

Dropout: Test time

Dropout makes our output random!

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image.png237

Want to "average out" the randomness at test-time

y=f(x)=Ez[f(x,z)]=p(z)f(x,z)dz

But this integral seems hard ...

Want to approximate the integral

y=f(x)=Ez[f(x,z)]=p(z)f(x,z)dz

Consider a single neuron.
At test time we have: E[a]=w1x+w2y
During training we have:

E[a]=14(w1x+w2y)+14(w1x+0y)+14(0x+0y)+14(0x+w2y)=12(w1x+w2y)

At test time, multiply by (1- dropout) probability

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-1.png140

Notes:

Regularization: A common pattern

Training: Add some kind of randomness

y=fW(x,z)

Testing: Average out randomness (sometimes approximate)

y=f(x)=Ez[f(x,z)]=p(z)f(x,z)dz

Regularization: Data Augmentation

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-2.png500

Notes:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-3.png500

Random crops and scales

Training: sample random crops / scales

ResNet:

  1. Pick random L in range [256, 480]
  2. Resize training image, short side =L
  3. Sample random 224×224 patch

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-4.png100

Training: sample random crops / scales

ResNet:

  1. Pick random L in range [256, 480]
  2. Resize training image, short side =L
  3. Sample random 224×224 patch

Testing: average a fixed set of crops

ResNet:

  1. Resize image at 5 scales: {224,256,384,480,640}
  2. For each size, use 10224×224 crops: 4 corners + center,+ flips

Regularization: A common pattern (summary)

Training: Add random noise
Testing: Marginalize over the noise

Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-5.png325x168

Transfer Learning

"You need a lot of a data if you want to train/use CNNs"

Transfer Learning with CNNs

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-6.png

Notes:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-7.png

CNN Architectures

Review: LeNet-5

This network will be similar to what will be on the final
00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-8.png518

Notes:

Case Study: AlexNet

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-9.png

Details/Retrospectives:

Notes:

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-10.png700

Notes:

Case Study: VGGNet

Small filters, Deeper networks

8 layers (AlexNet)
-> 16-19 layers (VGG16Net)

Only 3×3 CONV stride 1, pad 1 and 2×2 MAX POOL stride 2

11.7% top 5 error in ILSVRC'13 (ZFNet)
-> 7.3% top 5 error in ILSVRC'14

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-11.png375

Notes:


Q : Why use smaller filters? ( 3×3 conv)

Stack of three 3×3 conv (stride 1) layers has same effective receptive field as one 7×7 conv layer [7x7]

But deeper, more non-linearities

And fewer parameters: 3 * ( 32C2 ) vs. 72C2 for C channels per layer

Notes:


Example:
00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-12.png

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-13.png

Case Study: GoogLeNet

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-14.png304

Apply parallel filter operations on the input from previous layer:

Concatenate all filter outputs together depth-wise

Q: What is the problem with this? [Hint: Computational complexity]

Notes:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-15.png

Stack Inception modules with dimension reduction on top of each other

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-16.png400

Case Study: ResNet

(Look at slides for more detail)

What happens when we continue stacking deeper layers on a "plain" convolutional neural network?

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-17.png500

56-layer model performs worse on both training and test error
-> The deeper model performs worse, but it's not caused by overfitting!

Notes:


Hypothesis: the problem is an optimization problem, deeper models are harder to optimize

The deeper model should be able to perform at least as well as the shallower model.

A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.


Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-19.png417

Notes:


Full ResNet architecture:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-20.png346

Notes:

"Bottleneck"

For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet)

image-25.png345x252

Notes:

Identity Mappings in Residual Learning

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-21.png224

Notes:

Identity Mappings in Deep Residual Networks

Improving ResNets...

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-22.png333

There is one more thing that is important:

Notes:

Things to takeaway:

  1. What is the skip operation (ResNet)
  2. Bottleneck
  3. Identity mapping

Question 1: What if shortcut mapping h identity ?

(This is not in exam/not in homework)

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-23.png600

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-24.png593x328

What this paper propose is to make that skip connection along that identity.

Turns out that there is one person who did something similar but added more operations on every block (making it more complex), and it ended up not working.

Identity Mappings in Residual Learning

image-26.png453x199