09 - Convolutional Neural Networks (CNN)
Class: CSCE-421
Notes:
Fully Connected Layer
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219105125.png)
Convolution Layer
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219100750.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219100813.png)
Notes:
- For example if you have 3 images, you will apply one different filter to each one then we add them together
- In early days this technique has been used!
- More filters = more calculations -> your model is slower
- Most times we do full connections: each output is connected to each input, which means in this example we have 3 channels but that is not required.
- 3 images -> 3 filters -> 1 output slice
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101103.png)
Notes:
- Remember that a different filter generates a different output
- All the output slices are generated independently
- All output slices are completely independent
- Applying a filter will yield completely independent slices
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101206.png)
Notes:
- We want to use this convolution to get some features of the input image!
- Things to talk about on (will be on the exam)
- If you know the size of the input, if you know the size of the filter, how can you calculate the size of the output?
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101346.png)
Notes:
- This operation of convolution is by far the most important one
- It will be a network with many layers but the convolutional network will be by far the most important one
- If you want 6 output slices, you need to have
- To generate one slice you need a 5x5x3 filter
- To generate 6 you need (5x5x3)x6
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101455.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101509.png)
Notes:
- You want to have filters that may relate to the input image so you can account for features of the image or some rotations
- Convolution = elementwise multiplication and sum of a filter and the signal (image)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101615.png)
Notes:
- The image does not have to be a square but in most cases kernels are just squares (rectangle filters are not commonly used)
- How to calculate the output image size if we have a 32x32 input image with a 5x5 filter?
- Note size = 28 = 32 - 5
- The size is the determined by:
- size of input - size of filter + 1
A closer look at spatial dimensions:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102031.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102110.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102154.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102207.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102224.png)
Notes:
- Roughly the output will be half the sie of the input (if you do not consider the bottom)
- Remember so far we had:
- size = size of input - size of filter + 1
- Size = 7 - 3 + 1
- It will be 5, which is not
- So you need to do:
- size = ((input size - filter size)/ 2) + 1
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102455.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102509.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102525.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102544.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102612.png)
Notes:
- This is why the pixels on the borders are not treated fair in comparison to the pixels in the center
- So what have we done about this?
In practice: Common to zero pad the border
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102912.png)
Notes:
-
Every time you apply a filter, your output size is reduced by some amount
- This will limit the number of convolutional layers that you can apply, because each layer will receive a smaller slice
-
Now what if your input has been increased by 1 pixels in each direction?
-
This is equivalent to apply a padding
-
This will not affect the output size
- If you do the padding and apply convolution, you will never decrease the size of the slice
-
And this will help because it will somehow mitigate this bothering effect were border pixels were treated unfairly
-
In conclusion, the size of your output will be affected by:
- The size of the input
- The size of your filter
- The stride
- The padding?
-
Note that in most cases stride will be equal to 1 unless mentioned explicitly
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103225.png)
Notes:
- After applying convolution, padding on the input does not affect the size of the output
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103346.png)
Convolution: translation-equivariance
- Process each window in the same way
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103443.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103600.png)
- f(x) means a convolution
- T(x) means a translation (move the image)
- Again remember this is only true for translation
- The network so far is:
- Translation equivariant
- But not rotation equivariant
- Note that in general filters are not rotated, they are fixed
Convolution = local connection + weight-sharing
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103705.png)
Notes:
- Look at the right model, this is the fully connected layer we have talked about in 06 - Multi-Layer Perceptron-Networks
- How are convolutional layers related to this?
- Convolution is a special case of a fully connected layer
- If your kernel size is 3, the output will only be connected to 3 inputs, you can see this is a locally connected layer (like in the middle model)
- But the second thing about convolution is that we have shared parameters, each line of the same color has the same value.
- Note eventually these connection weights are trained from data, how can we guarantee that a matrix will have the same wiring ready for convolution?
- In practice you do not need to worry about this but in implementation you will need to initialize parameters to the same value and in an update you compute the gradient of each and you do the average of each.
- Not required to understand backward propagation in a convolutional layer in this class.
Convolution: linear transform
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219104430.png)
Notes:
-
The left diagram is a one-dimensional convolution representation
-
Think about it as an x vector (the input vector)
-
Your output is a 4-dimensional y vector
-
Values on the diagonal have to remain exactly the same
-
Note that each output is connected to 3 inputs in this case
-
That is equivalent to taking this w vector and get the product with the x vector
-
You still have a fully connected layer with a W matrix but what is special about it is that some of the entries are fixed to 0 and some of them have the same value.
-
Conceptually this is useful, but in practice this is not really what we implement, this is not the most efficient way of doing it.
-
The idea here is inductive bias
- The network is not smart enough, if you allow W to be a general matrix, it will not understand
- You need to ask the network to learn by themselves
Receptive Field
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219105052.png)
Notes:
- Each unit will look at 3 units in the previous layer
- The network will be stacked with many layers
- each unit on a top layer will look at 3 units on the bottom layer
- So each unit on top will be able to look at a larger and larger areas of the input (covering a larger span)
- The idea is that:
- If you have an image and you do some convolution and have some outputs
- A unit is looking at some areas of the input
- Our goal is to convert this area into a fixed length vector
- We want x to be a one by one feature map
- You could have a 100-dimensions by each one by one
- You are basically capturing 100 different features of the input
- This unit captures some kind of feature of the input image
- Translation Invariance:
- If you have an object and move it, it will also move after a convolution is applied
- If we keep reducing the size of the feature map with each layer, the final layer will be a very small feature map
- This final layer is something called global pooling, we will compute the max over this final feature map
- Now the location does not matter anymore
Examples time:
Input volume:
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224100140.png)
- Note pad 2 means 4 in each dimension
- Remember these are fully connected layer (always care about this)
Output volume size: ?
-
32x32x10
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224100404.png)
- There will be exam questions about this
Number of parameters in this layer?
-
Look at [[#Convolution linear transform]]
-
To generate one output slice you need a filter of size 5x5x3
-
This has nothing to do with stride, pad, size of input or size of output
-
This is only related to the size of filter and the number of inout feature map
-
To generate one output feature map you need 5x5x3 + 1 parameters
- (+1 for bias)
-
In this case you want to generate 10 slices so the number of parameters will be (5x5x3 + 1)x10
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224101129.png)
- So what is an important factor for determining the number of parameters?
- (5x5) size of filter -> larger filter = more parameters
- (x3) input feature map -> every output is connected to every input feature map, this way you will have more filters
- (x10) output slices -> more output slices = more parameters because each slice is completely independent
- So what is an important factor for determining the number of parameters?
-
To reduce the size of your parameters you can reduce any of the above
-
What is the smallest filter you can use?
- A 1x1 -> basically a number
- But this will not have a interesting pattern
- Although it is used in other applications
- In reality if you want to extract an interesting pattern you need at least a 3x3 filter.
- A 1x1 -> basically a number
1x1 convolution layers make perfect sense
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224101911.png)
Notes:
- How are these filters useful?
- If you apply a 1x1 filter, the output size will not change even without padding
- Because you have 64 slices, you need to have 64 filters, each of these filters will be 1x1
- You just need to worry about generating one slice (each one is generated independently in the say manner):
- You multiply each of the slices by the filter number
- Each output slice will be a linear combination of the input slices
- This is useful just because:
- If we want to apply a convolution, the size of the kernel will be small
- Lets say we want to generate 3x3 filters for 32 slices
- Number of parameters will be roughly 1x1x64 for 1 output slice
- Then you apply the 3x3 filter directly on top of this
- This way the total number of parameters compared to applying 3x3 directly
- 1x1 convolution is just used to change the number of feature maps
Pooling Layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224102723.png)
MAX POOLING
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224102749.png)
Notes:
- If we use a stride of 2, the size of the output is reduced by a factor of 2
- So stride is used to reduce the output slice size
- This is useful because:
- 0 parameters
- What if an object is slightly distorted
- If you do max pooling that local distortion is actually ignored
- There is no parameter here
- Slice by slice
- Each slice is completely independent
- You apply pooling to each of the slices completely independent
- the output dimension is reduced by a factor of 2, but will not change the number of slices
- 0 parameters
- How to do backward propagation on convolutional networks is not required in this class
Fully Connected Layer
- Contains neurons that connect to the entire input volume, as in ordinary Neural Networks
How to perform BP in convolution, pooling layers?
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224103244.png)
Notes:
- ReLU is just not a linear function, you do not need to worry about it
- ReLU(x) = max(0 x)
- This is useful because we can backward propagate with the max function
Optional for CSCE-421
Backpropagation in Convolutional Neural Networks
- Three operations of convolutions, max-pooling, and ReLU.
- The ReLU backpropagation is the same as any other network.
- Passes gradient to a previous layer only if the original input value was positive.
- The max-pooling passes the gradient flow through the largest cell in the input volume.
- Main complexity is in backpropagation through convolutions.
Backpropagating through Convolutions
- Traditional backpropagation is transposed matrix multiplication.
- Backpropagation through convolutions is transposed convoIution (i.e., with an inverted filter).
- Derivative of loss with respect to each cell is backpropagated.
- Elementwise approach of computing which cell in input contributes to which cell in output.
- Multiplying with an inverted filter.
- Convert layer-wise derivative to weight-wise derivative and add over shared weights.
Backpropagation with an Inverted Filter (Single Channel)
| a | b | c |
|---|---|---|
| d | e | f |
| g | h | i |
| Filter during convolution |
| i | h | g |
|---|---|---|
| f | e | d |
| c | b | a |
| Filter durin backpropagation |
- Multichannel case: We have 20 filters for 3 input channels (RGB) ⇒ We have
spatial slices. - Each of these 60 spatial slices will be inverted and grouped into 3 sets of filters with depth 20 (one each for RGB).
- Backpropagate with newly grouped filters.
Notes:
- Remember convolutions are just a special case of fully connected layers
- Can you convert back to a convolution?
- Yes, backward propagation is still another convolution
- But we need to use different filters (inverted in some way)
- Why is the filter inverted intuitively?
- In backward propagation you are looking at the network in backward matter
- In forward propagation you r filter moves from left to right from top to bottom
- The first time you will touch on a unit is a
- a came first
- You are basically looking at the last number in the filter first
Mini-batch SGD
Loop:
- Sample a batch of data
- Forward prop it through the graph (network), get loss
- Backprop to calculate the gradients
- Update the parameters using the gradient
Activation Functions
tanh(x)
-
Squashes numbers to range
[-1,1] -
zero centered (nice)
-
still kills gradients when saturated :(
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224104653.png)
ReLU (Rectified Linear Unit)
-
Computes
-
Does not saturate (in +region)
-
Very computationally efficient
-
Converges much faster than sigmoid/tanh in practice (e.g. 6x)
-
Actually more biologically plausible than sigmoid
-
Not zero-centered output
-
An annoyance:
- hint: what is the gradient when
?
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224104858.png)
- hint: what is the gradient when
Notes:
- Leaky ReLU is more complicated, basically you will have also a small negative slope as well
TLDR: in practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don't expect much
- Don't use sigmoid