09 - Convolutional Neural Networks (CNN)

#MachineLearning #NeuralNetworks #Convolution

Class: CSCE-421

Notes:

Fully Connected Layer

Pasted image 20260219105125.png|500

Convolution Layer

Pasted image 20260219100750.png|500

Pasted image 20260219100813.png|500

Notes:

For example if you have 3 images, you will apply one different filter to each one then we add them together
In early days this technique has been used!
More filters = more calculations -> your model is slower
Most times we do full connections: each output is connected to each input, which means in this example we have 3 channels but that is not required.
3 images -> 3 filters -> 1 output slice

Pasted image 20260219101103.png|500

Notes:

Remember that a different filter generates a different output
All the output slices are generated independently
All output slices are completely independent
Applying a filter will yield completely independent slices

Pasted image 20260219101206.png|500

Notes:

We want to use this convolution to get some features of the input image!
Things to talk about on (will be on the exam)
- If you know the size of the input, if you know the size of the filter, how can you calculate the size of the output?

Pasted image 20260219101346.png|500

Notes:

This operation of convolution is by far the most important one
It will be a network with many layers but the convolutional network will be by far the most important one
If you want 6 output slices, you need to have
- To generate one slice you need a 5x5x3 filter
- To generate 6 you need (5x5x3)x6

Pasted image 20260219101455.png|500

Pasted image 20260219101509.png|500

f [x, y] * g [x, y] = \sum_{n_{1} = - \infty}^{\infty} \sum_{n_{2} = - \infty}^{\infty} f [n_{1}, n_{2}] \cdot g [x - n_{1}, y - n_{2}]

Notes:

You want to have filters that may relate to the input image so you can account for features of the image or some rotations
Convolution = elementwise multiplication and sum of a filter and the signal (image)

Pasted image 20260219101615.png|500

Notes:

The image does not have to be a square but in most cases kernels are just squares (rectangle filters are not commonly used)
How to calculate the output image size if we have a 32x32 input image with a 5x5 filter?
Note size = 28 = 32 - 5
The size is the determined by:
- size of input - size of filter + 1

A closer look at spatial dimensions:

Notes:

Roughly the output will be half the sie of the input (if you do not consider the bottom)
Remember so far we had:
- size = size of input - size of filter + 1
Size = 7 - 3 + 1
- It will be 5, which is not
So you need to do:
- size = ((input size - filter size)/ 2) + 1

size = \frac{input size - filter size}{2} + 1

\begin{aligned} Output size: \\ \begin{aligned} (N - F) / stride + 1 \\ e.g. N = 7, F = 3 : \\ stride 1 \Rightarrow (7 - 3) / 1 + 1 = 5 \\ stride 2 \Rightarrow (7 - 3) / 2 + 1 = 3 \\ stride 3 \Rightarrow (7 - 3) / 3 + 1 = 2.33 \end{aligned} \end{aligned}

Notes:

This is why the pixels on the borders are not treated fair in comparison to the pixels in the center
So what have we done about this?

In practice: Common to zero pad the border

Pasted image 20260219102912.png|500

Notes:

Every time you apply a filter, your output size is reduced by some amount
- This will limit the number of convolutional layers that you can apply, because each layer will receive a smaller slice
Now what if your input has been increased by 1 pixels in each direction?
This is equivalent to apply a padding
This will not affect the output size
- If you do the padding and apply convolution, you will never decrease the size of the slice
And this will help because it will somehow mitigate this bothering effect were border pixels were treated unfairly
In conclusion, the size of your output will be affected by:
- The size of the input
- The size of your filter
- The stride
- The padding?
Note that in most cases stride will be equal to 1 unless mentioned explicitly

Pasted image 20260219103225.png|500

Notes:

After applying convolution, padding on the input does not affect the size of the output

Pasted image 20260219103346.png|500

Convolution: translation-equivariance

Process each window in the same way

Pasted image 20260219103443.png|400

Pasted image 20260219103600.png|500

f(x) means a convolution
T(x) means a translation (move the image)
Again remember this is only true for translation
The network so far is:
- Translation equivariant
- But not rotation equivariant
  - Note that in general filters are not rotated, they are fixed

Pasted image 20260219103705.png|500

Notes:

Look at the right model, this is the fully connected layer we have talked about in 06 - Multi-Layer Perceptron-Networks
How are convolutional layers related to this?
Convolution is a special case of a fully connected layer
- If your kernel size is 3, the output will only be connected to 3 inputs, you can see this is a locally connected layer (like in the middle model)
- But the second thing about convolution is that we have shared parameters, each line of the same color has the same value.
Note eventually these connection weights are trained from data, how can we guarantee that a matrix will have the same wiring ready for convolution?
- In practice you do not need to worry about this but in implementation you will need to initialize parameters to the same value and in an update you compute the gradient of each and you do the average of each.
- Not required to understand backward propagation in a convolutional layer in this class.

Convolution: linear transform

Pasted image 20260219104430.png|500

Notes:

The left diagram is a one-dimensional convolution representation
Think about it as an x vector (the input vector)
Your output is a 4-dimensional y vector
Values on the diagonal have to remain exactly the same
Note that each output is connected to 3 inputs in this case
That is equivalent to taking this w vector and get the product with the x vector
You still have a fully connected layer with a W matrix but what is special about it is that some of the entries are fixed to 0 and some of them have the same value.
Conceptually this is useful, but in practice this is not really what we implement, this is not the most efficient way of doing it.
The idea here is inductive bias
- The network is not smart enough, if you allow W to be a general matrix, it will not understand
- You need to ask the network to learn by themselves

Receptive Field

Pasted image 20260219105052.png|500

Notes:

Each unit will look at 3 units in the previous layer
The network will be stacked with many layers
each unit on a top layer will look at 3 units on the bottom layer
So each unit on top will be able to look at a larger and larger areas of the input (covering a larger span)
The idea is that:
- If you have an image and you do some convolution and have some outputs
- A unit is looking at some areas of the input
- Our goal is to convert this area into a fixed length vector
- We want x to be a one by one feature map
  - You could have a 100-dimensions by each one by one
  - You are basically capturing 100 different features of the input
- This unit captures some kind of feature of the input image
Translation Invariance:
- If you have an object and move it, it will also move after a convolution is applied
- If we keep reducing the size of the feature map with each layer, the final layer will be a very small feature map
- This final layer is something called global pooling, we will compute the max over this final feature map
  - Now the location does not matter anymore

Examples time:

Input volume: $32 \times 32 \times 3$
$10 5 \times 5$ filters with stride 1 , pad 2

Pasted image 20260224100140.png|200

Note pad 2 means 4 in each dimension
Remember these are fully connected layer (always care about this)

Output volume size: ?

32x32x10
- There will be exam questions about this

Number of parameters in this layer?

Look at [[#Convolution linear transform]]
To generate one output slice you need a filter of size 5x5x3
This has nothing to do with stride, pad, size of input or size of output
This is only related to the size of filter and the number of inout feature map
To generate one output feature map you need 5x5x3 + 1 parameters
- (+1 for bias)
In this case you want to generate 10 slices so the number of parameters will be (5x5x3 + 1)x10
- So what is an important factor for determining the number of parameters?
  - (5x5) size of filter -> larger filter = more parameters
  - (x3) input feature map -> every output is connected to every input feature map, this way you will have more filters
  - (x10) output slices -> more output slices = more parameters because each slice is completely independent
To reduce the size of your parameters you can reduce any of the above
What is the smallest filter you can use?
- A 1x1 -> basically a number
  - But this will not have a interesting pattern
  - Although it is used in other applications
- In reality if you want to extract an interesting pattern you need at least a 3x3 filter.

1x1 convolution layers make perfect sense

Pasted image 20260224101911.png|425

Notes:

How are these filters useful?
If you apply a 1x1 filter, the output size will not change even without padding
Because you have 64 slices, you need to have 64 filters, each of these filters will be 1x1
You just need to worry about generating one slice (each one is generated independently in the say manner):
- You multiply each of the slices by the filter number
- Each output slice will be a linear combination of the input slices
This is useful just because:
- If we want to apply a convolution, the size of the kernel will be small
- Lets say we want to generate 3x3 filters for 32 slices
- Number of parameters will be roughly 1x1x64 for 1 output slice
- Then you apply the 3x3 filter directly on top of this
- This way the total number of parameters compared to applying 3x3 directly
1x1 convolution is just used to change the number of feature maps

Pooling Layer

makes the representations smaller and more manageable
operates over each activation map independently:

Pasted image 20260224102723.png|300

MAX POOLING

Pasted image 20260224102749.png|400

Notes:

If we use a stride of 2, the size of the output is reduced by a factor of 2
So stride is used to reduce the output slice size
This is useful because:
1. 0 parameters
  - What if an object is slightly distorted
  - If you do max pooling that local distortion is actually ignored
  - There is no parameter here
2. Slice by slice
  - Each slice is completely independent
  - You apply pooling to each of the slices completely independent
  - the output dimension is reduced by a factor of 2, but will not change the number of slices
How to do backward propagation on convolutional networks is not required in this class

Fully Connected Layer

Contains neurons that connect to the entire input volume, as in ordinary Neural Networks

How to perform BP in convolution, pooling layers?

Pasted image 20260224103244.png|500

Notes:

ReLU is just not a linear function, you do not need to worry about it
- ReLU(x) = max(0 x)
This is useful because we can backward propagate with the max function

Optional for CSCE-421

Backpropagation in Convolutional Neural Networks

Three operations of convolutions, max-pooling, and ReLU.
The ReLU backpropagation is the same as any other network.
Passes gradient to a previous layer only if the original input value was positive.
The max-pooling passes the gradient flow through the largest cell in the input volume.
Main complexity is in backpropagation through convolutions.

Backpropagating through Convolutions

Traditional backpropagation is transposed matrix multiplication.
Backpropagation through convolutions is transposed convoIution (i.e., with an inverted filter).
Derivative of loss with respect to each cell is backpropagated.
Elementwise approach of computing which cell in input contributes to which cell in output.
Multiplying with an inverted filter.
Convert layer-wise derivative to weight-wise derivative and add over shared weights.

Backpropagation with an Inverted Filter (Single Channel)

a	b	c
d	e	f
g	h	i
Filter during convolution

i	h	g
f	e	d
c	b	a
Filter durin backpropagation

Multichannel case: We have 20 filters for 3 input channels (RGB) ⇒ We have $20 \times 3 = 60$ spatial slices.
Each of these 60 spatial slices will be inverted and grouped into 3 sets of filters with depth 20 (one each for RGB).
Backpropagate with newly grouped filters.

Notes:

Remember convolutions are just a special case of fully connected layers
Can you convert back to a convolution?
- Yes, backward propagation is still another convolution
- But we need to use different filters (inverted in some way)
Why is the filter inverted intuitively?
- In backward propagation you are looking at the network in backward matter
- In forward propagation you r filter moves from left to right from top to bottom
- The first time you will touch on a unit is a
- a came first
- You are basically looking at the last number in the filter first