09 - Convolutional Neural Networks
Class: CSCE-421
Notes:
Fully Connected Layer
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219105125.png)
Convolution Layer
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219100750.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219100813.png)
Notes:
- For example if you have 3 images, you will apply one different filter to each one then we add them together
- In early days this technique has been used!
- More filters = more calculations -> your model is slower
- Most times we do full connections: each output is connected to each input, which means in this example we have 3 channels but that is not required.
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101103.png)
Notes:
- Remember that a different filter generates a different output
- All the output slices are generated independently
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101206.png)
Notes:
- We want to use this convolution to get some features of the input image!
- Things to talk about on (will be on the exam)
- If you know the size of the input, if you know the size of the filter, how can you calculate the size of the output?
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101346.png)
Notes:
- This operation of convolution is by far the most important one
- It will be a network with many layers but the convolutional network will be by far the most important one
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101455.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101509.png)
Notes:
- You want to have filters that may relate to the input image so you can account for features of the image or some rotations
- Convolution = elementwise multiplication and sum of a filter and the signal (image)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101615.png)
Notes:
- The image does not have to be a square but in most cases kernels are just squares (rectangle filters are not commonly used)
- How to calculate the output image size if we have a 32x32 input image with a 5x5 filter?
- Note size = 28 = 32 - 5
- The size is the determined by:
- size of input - size of filter + 1
A closer look at spatial dimensions:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102031.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102110.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102154.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102207.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102224.png)
Notes:
- Roughly the output will be half the sie of the input (if you do not consider the bottom)
- Remember so far we had:
- size = size of input - size of filter + 1
- Size = 7 - 3 + 1
- It will be 5, which is not
- So you need to do:
- size = ((input size - filter size)/ 2) + 1
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102455.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102509.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102525.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102544.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102612.png)
Notes:
- This is why the pixels on the borders are not treated fair in comparison to the pixels in the center
- So what have we done about this?
In practice: Common to zero pad the border
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102912.png)
Notes:
- Now your input has been increased by 1 pixels in each direction
- This is equivalent to apply a padding
- This will not affect the output size
- And this will help because it will somehow mitigate this bothering effect were border pixels were treated unfairly
- In conclusion, the size of your output will be affected by:
- The size of the input
- The size of your filter
- The stride
- The padding?
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103225.png)
Notes:
- After applying convolution, padding on the input does not affect the size of the output
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103346.png)
Convolution: translation-equivariance
- Process each window in the same way
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103443.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103600.png)
- Again remember this is only tru for translation
- The network so far is:
- Translation equivariant
- But not rotation equivariant
Convolution = local connection + weight-sharing
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103705.png)
Notes:
- Look at the right model, this is the fully connected layer we have talked about in 06 - Multi-Layer Perceptron-Networks
- How are convolutional layers related to this?
- Convolution is a special case of a fully connected layer
- If your kernel size is 3, the output will only be connected to 3 inputs, you can see this is a locally connected layer (like in the middle model)
- But the second thing about convolution is that we have shared parameters, each line of the same color has the same value.
- Note eventually these connection weights are trained from data, how can we guarantee that a matrix will have the same wiring ready for convolution?
- In practice you do not need to worry about this but in implementation you will need to initialize parameters to the same value and in an update you compute the gradient of each and you do the average of each.
- Not required to understand backward propagation in a convolutional layer in this class.
Convolution: linear transform
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219104430.png)
Notes:
- Think about it as an X vector (the input vector)
- Your output is a 4-dimensional y vector
- Values on the diagonal have to remain exactly the same
- Note that each output is connected to 3 inputs in this case
- That is equivalent to taking this w vector and get the product with the x vector
- You still have a fully connected layer with a W matrix but what is special about it is that some of the entries are fixed to 0 and some of them have the same value.
- Conceptually this is useful, but in practice this is not really what we implement, this is not the most efficient way of doing it.
Receptive Field
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219105052.png)
Notes:
- Each unit will look at 3 units in the previous layer
- The network will be stacked with many layers
- each unit on a top layer will look at 3 units on the bottom layer
- So each unit on top will be able to look at larger and larger areas of the input (covering a larger span)
- The idea is that:
- If you have an image and you do some convolution and have some outputs
- A unit is looking at some areas of the input
- Our goal is to convert this area into a vector
- We want x to be a one by one feature map
- This unit captures some kind of feature of the input image