09 - Convolutional Neural Networks (CNN)
Class: CSCE-421
Notes:
Convolutions
Fully Connected Layer
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219105125.png)
Convolution Layer
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219100750.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219100813.png)
Notes:
- For example if you have 3 images, you will apply one different filter to each one then we add them together
- In early days this technique has been used!
- More filters = more calculations -> your model is slower
- Most times we do full connections: each output is connected to each input, which means in this example we have 3 channels but that is not required.
- 3 images -> 3 filters -> 1 output slice
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101103.png)
Notes:
- Remember that a different filter generates a different output
- All the output slices are generated independently
- All output slices are completely independent
- Applying a filter will yield completely independent slices
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101206.png)
Notes:
- We want to use this convolution to get some features of the input image!
- Things to talk about on (will be on the exam)
- If you know the size of the input, if you know the size of the filter, how can you calculate the size of the output?
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101346.png)
Notes:
- This operation of convolution is by far the most important one
- It will be a network with many layers but the convolutional network will be by far the most important one
- If you want 6 output slices, you need to have
- To generate one slice you need a 5x5x3 filter
- To generate 6 you need (5x5x3)x6
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101455.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101509.png)
Notes:
- You want to have filters that may relate to the input image so you can account for features of the image or some rotations
- Convolution = elementwise multiplication and sum of a filter and the signal (image)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219101615.png)
Notes:
- The image does not have to be a square but in most cases kernels are just squares (rectangle filters are not commonly used)
- How to calculate the output image size if we have a 32x32 input image with a 5x5 filter?
- Note size = 28 = 32 - 5
- The size is the determined by:
- size of input - size of filter + 1
A closer look at spatial dimensions:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102031.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102110.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102154.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102207.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102224.png)
Notes:
- Roughly the output will be half the size of the input (if you do not consider the bottom)
- Remember so far we had:
- size = size of input - size of filter + 1
- Size = 7 - 3 + 1
- It will be 5, which is not
- So you need to do:
- size = ((input size - filter size)/ 2) + 1
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102455.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102509.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102525.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102544.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102612.png)
Notes:
- This is why the pixels on the borders are not treated fair in comparison to the pixels in the center
- So what have we done about this?
Stride
- The stride is a parameter that determines how many pixels the filter (or kernel) moves across the input image at a time. Think of it as the "step size" of the sliding window as it scans the data.
- When a convolutional layer processes an image, it doesn't just look at the whole thing at once; it slides a small filter over the pixels.
- Stride of 1: The filter moves one pixel at a time. this captures the most detail but results in a larger output.
- Stride of 2: The filter jumps two pixels at a time. This skips every other pixel, effectively "jumping" across the image.
- When a convolutional layer processes an image, it doesn't just look at the whole thing at once; it slides a small filter over the pixels.
- Why Adjust the Stride?
- Adjusting the stride allows you to control two main factors in your model:
- Output Dimensions: A larger stride reduces the spatial dimensions (height and width) of the output volume. This is often used as an alternative to pooling layers to downsample the data.
- Computational Efficiency: Because a larger stride results in fewer total "steps" for the filter to take, it reduces the number of calculations required, speeding up the processing time.
- Adjusting the stride allows you to control two main factors in your model:
In practice: Common to zero pad the border
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219102912.png)
Notes:
-
Every time you apply a filter, your output size is reduced by some amount
- This will limit the number of convolutional layers that you can apply, because each layer will receive a smaller slice
-
Now what if your input has been increased by 1 pixels in each direction?
-
This is equivalent to apply a padding
-
This will not affect the output size
- If you do the padding and apply convolution, you will never decrease the size of the slice
-
And this will help because it will somehow mitigate this bothering effect were border pixels were treated unfairly
-
In conclusion, the size of your output will be affected by:
- The size of the input
- The size of your filter
- The stride
- The padding?
-
Note that in most cases stride will be equal to 1 unless mentioned explicitly
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103225.png)
Notes:
- After applying convolution, padding on the input does not affect the size of the output
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103346.png)
Convolution: translation-equivariance
- Process each window in the same way
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103443.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103600.png)
- f(x) means a convolution
- T(x) means a translation (move the image)
- Again remember this is only true for translation
- The network so far is:
- Translation equivariant
- But not rotation equivariant
- Note that in general filters are not rotated, they are fixed
Convolution = local connection + weight-sharing
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219103705.png)
Notes:
- Look at the right model, this is the fully connected layer we have talked about in 06 - Multi-Layer Perceptron-Networks
- How are convolutional layers related to this?
- Convolution is a special case of a fully connected layer
- If your kernel size is 3, the output will only be connected to 3 inputs, you can see this is a locally connected layer (like in the middle model)
- But the second thing about convolution is that we have shared parameters, each line of the same color has the same value.
- Note eventually these connection weights are trained from data, how can we guarantee that a matrix will have the same wiring ready for convolution?
- In practice you do not need to worry about this but in implementation you will need to initialize parameters to the same value and in an update you compute the gradient of each and you do the average of each.
- Not required to understand backward propagation in a convolutional layer in this class.
Convolution: linear transform
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219104430.png)
Notes:
-
The left diagram is a one-dimensional convolution representation
-
Think about it as an x vector (the input vector)
-
Your output is a 4-dimensional y vector
-
Values on the diagonal have to remain exactly the same
-
Note that each output is connected to 3 inputs in this case
-
That is equivalent to taking this w vector and get the product with the x vector
-
You still have a fully connected layer with a W matrix but what is special about it is that some of the entries are fixed to 0 and some of them have the same value.
-
Conceptually this is useful, but in practice this is not really what we implement, this is not the most efficient way of doing it.
-
The idea here is inductive bias
- The network is not smart enough, if you allow W to be a general matrix, it will not understand
- You need to ask the network to learn by themselves
Receptive Field
/CSCE-421/Visual%20Aids/Pasted%20image%2020260219105052.png)
Notes:
- Each unit will look at 3 units in the previous layer
- The network will be stacked with many layers
- each unit on a top layer will look at 3 units on the bottom layer
- So each unit on top will be able to look at a larger and larger areas of the input (covering a larger span)
- The idea is that:
- If you have an image and you do some convolution and have some outputs
- A unit is looking at some areas of the input
- Our goal is to convert this area into a fixed length vector
- We want x to be a one by one feature map
- You could have a 100-dimensions by each one by one
- You are basically capturing 100 different features of the input
- This unit captures some kind of feature of the input image
- Translation Invariance:
- If you have an object and move it, it will also move after a convolution is applied
- If we keep reducing the size of the feature map with each layer, the final layer will be a very small feature map
- This final layer is something called global pooling, we will compute the max over this final feature map
- Now the location does not matter anymore
Examples time:
Input volume:
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224100140.png)
- Note pad 2 means 4 in each dimension
- Remember these are fully connected layer (always care about this)
Output volume size: ?
-
32x32x10
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224100404.png)
- There will be exam questions about this
Number of parameters in this layer?
-
Look at #Convolution linear transform
-
To generate one output slice you need a filter of size 5x5x3
-
This has nothing to do with stride, pad, size of input or size of output
-
This is only related to the size of filter and the number of input feature map
-
To generate one output feature map you need 5x5x3 + 1 parameters
- (+1 for bias)
-
In this case you want to generate 10 slices so the number of parameters will be (5x5x3 + 1)x10
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224101129.png)
- So what is an important factor for determining the number of parameters?
- (5x5) size of filter -> larger filter = more parameters
- (x3) input feature map -> every output is connected to every input feature map, this way you will have more filters
- (x10) output slices -> more output slices = more parameters because each slice is completely independent
- So what is an important factor for determining the number of parameters?
-
To reduce the size of your parameters you can reduce any of the above
-
What is the smallest filter you can use?
- A 1x1 -> basically a number
- But this will not have a interesting pattern
- Although it is used in other applications
- In reality if you want to extract an interesting pattern you need at least a 3x3 filter.
- A 1x1 -> basically a number
1x1 convolution layers make perfect sense
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224101911.png)
- "it is just used to reduce the number of feature maps -> and therefore reduce the number of parameters"
Notes:
- How are these filters useful?
- If you apply a 1x1 filter, the output size will not change even without padding
- Because you have 64 slices, you need to have 64 filters, each of these filters will be 1x1
- You just need to worry about generating one slice (each one is generated independently in the say manner):
- You multiply each of the slices by the filter number
- Each output slice will be a linear combination of the input slices
- This is useful just because:
- If we want to apply a convolution, the size of the kernel will be small
- Lets say we want to generate 3x3 filters for 32 slices
- Number of parameters will be roughly 1x1x64 for 1 output slice
- Then you apply the 3x3 filter directly on top of this
- This way the total number of parameters compared to applying 3x3 directly
- 1x1 convolution is just used to change the number of feature maps
Pooling Layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224102723.png)
MAX POOLING
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224102749.png)
Notes:
- If we use a stride of 2, the size of the output is reduced by a factor of 2
- So stride is used to reduce the output slice size
- This is useful because:
- 0 parameters
- What if an object is slightly distorted
- If you do max pooling that local distortion is actually ignored
- There is no parameter here
- Slice by slice
- Each slice is completely independent
- You apply pooling to each of the slices completely independent
- the output dimension is reduced by a factor of 2, but will not change the number of slices
- 0 parameters
- How to do backward propagation on convolutional networks is not required in this class
Fully Connected Layer
- Contains neurons that connect to the entire input volume, as in ordinary Neural Networks
How to perform BP in convolution, pooling layers?
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224103244.png)
Notes:
- ReLU is just not a linear function, you do not need to worry about it
- ReLU(x) = max(0 x)
- This is useful because we can backward propagate with the max function
Optional for CSCE-421
Backpropagation in Convolutional Neural Networks
- Three operations of convolutions, max-pooling, and ReLU.
- The ReLU backpropagation is the same as any other network.
- Passes gradient to a previous layer only if the original input value was positive.
- The max-pooling passes the gradient flow through the largest cell in the input volume.
- Main complexity is in backpropagation through convolutions.
Backpropagating through Convolutions
- Traditional backpropagation is transposed matrix multiplication.
- Backpropagation through convolutions is transposed convoIution (i.e., with an inverted filter).
- Derivative of loss with respect to each cell is backpropagated.
- Elementwise approach of computing which cell in input contributes to which cell in output.
- Multiplying with an inverted filter.
- Convert layer-wise derivative to weight-wise derivative and add over shared weights.
Backpropagation with an Inverted Filter (Single Channel)
| a | b | c |
|---|---|---|
| d | e | f |
| g | h | i |
| Filter during convolution |
| i | h | g |
|---|---|---|
| f | e | d |
| c | b | a |
| Filter durin backpropagation |
- Multichannel case: We have 20 filters for 3 input channels (RGB) ⇒ We have
spatial slices. - Each of these 60 spatial slices will be inverted and grouped into 3 sets of filters with depth 20 (one each for RGB).
- Backpropagate with newly grouped filters.
Notes:
- Remember convolutions are just a special case of fully connected layers
- Can you convert back to a convolution?
- Yes, backward propagation is still another convolution
- But we need to use different filters (inverted in some way)
- Why is the filter inverted intuitively?
- In backward propagation you are looking at the network in backward matter
- In forward propagation you r filter moves from left to right from top to bottom
- The first time you will touch on a unit is a
- a came first
- You are basically looking at the last number in the filter first
Mini-batch SGD
Loop:
- Sample a batch of data
- Forward prop it through the graph (network), get loss
- Backprop to calculate the gradients
- Update the parameters using the gradient
Activation Functions
tanh(x)
-
Squashes numbers to range
[-1,1] -
zero centered (nice)
-
still kills gradients when saturated :(
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224104653.png)
ReLU (Rectified Linear Unit)
-
Computes
-
Does not saturate (in +region)
-
Very computationally efficient
-
Converges much faster than sigmoid/tanh in practice (e.g. 6x)
-
Actually more biologically plausible than sigmoid
-
Not zero-centered output
-
An annoyance:
- hint: what is the gradient when
?
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260224104858.png)
- hint: what is the gradient when
Notes:
- Leaky ReLU is more complicated, basically you will have also a small negative slope as well
TLDR: in practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don't expect much
- Don't use sigmoid
Data Preprocessing
Learning Rate in Gradient Descent
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226093723.png)
Notes:
- With smaller step size you will do more steps
- With larger step size you will do less steps
Normalization
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226093809.png)
- Same learning rate applied to all weights
- Large weights dominate updates
- Small weights oscillate (or diverge)
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226093843.png)
- Similar pace for all weights
Notes:
- You might need to use different linear rates for different w
- We need to do some data normalization to make use our data is roughly on the same scale
Data normalization in machine learning
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226093919.png)
Notes:
- How do we normalize training data?
- Each row in the above example is x
- You compute the mean and get the standard deviation for each of the features
- You want each of this features here to be 0 and small standard deviation
- In summary: Subtract the mean and divide (element-wise) by the standard deviation
- How do we normalize test data?
- You still need a mean and standard deviation right?
- Wouldn't it be the same as the training data?
- Yes, you want your data to be subtracted and divided by the same numbers for training and test
- Problem:
- Predictions for each sample can depend on other samples
- This means your prediction will change depending on who is in the same class
- You can get different predictions for the same model
- This is a very common mistake, and is a key idea oh how we do data normalization
TLDR: IN practice for Images: center only
- e.g. consider CIFAR-10 example with
images - Subtract the mean image (e.g. AlexNet)
- (mean image =
[32,32,3]array)
- (mean image =
- Subtract per-channel mean (e.g. VGGNet)
- (mean along each channel
numbers)
- (mean along each channel
- Subtract the mean image (e.g. AlexNet)
Not common to normalize variance, to do PCA or whitening
Normalization Modules
- We want to maintain variance for all layers
- normalize features in the network
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226094159.png)
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226094247.png)
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226095826.png)
Notes:
- You have a network of many layers and you need to insert some normalization layers between layers to make sure you are normalizing your data before inputing to the model layer itself.
Batch Normalization
ONE OF THE MOST IMPORTANT TOPICS IN DEEP LEARNING
- Allows us to actually build networks of many layers
- There will be a question about batch normalization in the final exam
Batch Normalization
"you want zero-mean unit-variance activations? just make them so."
consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:
this is a vanilla differentiable function...
Notes:
- When we are training deep learning models, we do mini-batch SGD
- You sample a batch of data
- Use batch to train your network
- calculate gradients
- Update parameters
- You divide your data into mini-batches and you do forward propagation, backward propagation and anything needed for one layer before moving to the next batch
- The whole set of batches will be called epoch
"you want zero-mean unit-variance activations? just make them so."
-
compute the empirical mean and variance independently for each dimension.
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226100417.png)
- Fully connected layer with D units
- For each sample you will get an ... dimensional vector
- Get the mean, get the standard deviation, subtract mean and divide standard deviation element-wise
-
Normalize
Notes:
- This is in training
- You normalize for each of the mini-batches
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226100614.png)
- BN = Batch Normalization
- FC = Fully Connected
Notes:
- Conceptually this normalization layer can be inserted anywhere, because each of the layers receives input, does some process and returns some output.
- You usually insert after fully connected or convolutional layers, and before nonlinearity
Batch Normalization (Math)
Normalize:
And then allow the network to squash the range if it wants to:
Note, the network can learn:
to recover the identity mapping.
Notes:
- A paper proposed doing something slightly different
- The normalization operation is purely computed of your data
- But with this change these two parameters become learnable as well
- The key idea is that they try to introduce some learnable parameter in this layer
- They try to do some data dependent adaptation (learning) and adjust normalization in that way since mean and standard deviation are not perfect.
- We can then use this to essentially un-do the normalization
- Each of these of parameters are vectors of the same dimension of the mean and standard deviation vectors
- Remember that all of this is what happens in training, not in test time
Batch Normalization (Algorithm)
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe
Note: at test time BatchNorm layer functions differently:
- The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
- (e.g. can be estimated during training with running averages)
Notes:
-
During training you will need to somehow keep the learned parameters for batch normalization and once in test time you will use these
-
Essentially you are using your training in your test
-
Basically you need to do some rounding average in training of the mean and standard deviation and use that in our test.
- "Maintain a global mean and global standard deviation in training"
-
Pytorch already handles this for you
-
Consequence of using mini-batch
- Suppose you are training a network with many different layers
- Lets say you do not use batch normalization, in training you only need to give one sample at a time, for each of the 10 samples, you won't update until you finish each of the 10 samples
- If we add mini-batch between layers, can we still do that?
- No because for each sample, we will need all other samples in order to do normalization
- Basically you need to give this 10-sample at once to the network
- The consequence is that:
- In between each of these layers you need to keep somehow in memory the output of each layer
- For each layer you will have some impact in memory
- If you use very large mini-batch size, the activations for each layer will become larger -> takes more memory
- So your mini-batch size will be limited by your memory
- In theory larger mini-batch size is better but you are limited by your GPU memory
- Usually not more than 10 GB
- High end GPUs use 18 GB
- So it is a very important point since we do not have huge memory sizes
- But larger mini batches -> larger data set -> more accurate
- QUESTION ON EXAM ABOUT LIMITATION OF MINI-BATCH
...
Batch Normalization: Test Time
Input:
Learnable params:
Intermediates:
Output:
Batch Normalization for ConvNets
Batch Normalization for fully-connected networks
Batch Normalization for convolutional networks (Spatial Batchnorm, BatchNorm2D)
Notes:
- That x vector is essentially why we need to understand mini-batch limitation, it gets bigger with your mini-batch size (N)
Layer Normalization
Batch Normalization for fully-connected networks
- Mean and standard deviation is d-dimention
- Here your normalization is across D
Layer Normalization for fully-connected networks Same behavior at train and test! Can be used in recurrent networks
- Your mean and standard deviation have dimension of the batch size
- Here your normalization is across N
Notes:
- Example: sentences
- If you have 10 sentences in the same mini-batch, and each sentence can have different length? how do you deal with that? You need to normalize!
- You can use a sentence length threshold like 30 words and add or remove data from smaller or larger sentences
- Fun fact: graphics use layer normalization
Early Stopping
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226103515.png)
- Stop training the model when accuracy on the validation set decreases
- Or train for a long time, but always keep track of the model snapshot that worked best on val
Notes:
- If you train a Logistic regression you do not want to stop
- Once you move to networks, there is a hidden layer
- You cannot use the normal gradient because it will give you a local minimum
- You still need to have a validation data set somehow to tell you decide when to stop
- This is because training accuracy will increase and increase but will eventually exaggerate
- You need to reach that point of exaggeration where you need to stop, and you can now this point with a validation data set
Regularization: Dropout
In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226103830.png)
Notes:
- This is a very useful technique
- This only works on fully connected layers, if you have a network, there must be at least one fully connected layer and this will apply only to that layer
- Drop out:
- It does something different in training and test
- In training time we basically remove units, removing means removing input and output connection of a unit.
- During forward and backward propagation you remove nodes
- This is special for each sample
- If your dropout probability is 0.2, you have 0.2 probability of removing each of these nodes
How can this possibly be a good idea?
Forces the network to have a redundant representation; Prevents co-adaptation of features
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226104312.png)
Notes:
- Forces you to create some kind of redundancy in your features so that if some of them are remove you can still recognize an object
- The idea is that you want to get rid of some redundancy of the representations but you still want to make accurate predictions
Another Interpretation
- Dropout is training a large ensemble of models (that share parameters).
- Building multiple different models around a single set of data
- Here comes the idea of a Random Forest
- Each binary mask is one model
/CSCE-421/Ex2/Visual%20Aids/Pasted%20image%2020260226104505.png)
- How many possible ways are there to drop out an entire layer if each unit has two possibilities (drop out or not drop out)
- An FC layer with 4096 units has
possible masks! - This is a huge number! -> so many different models!
- Only
atoms in the universe...
Notes:
- Each time you see a different dropout pattern, you are looking at a different network with the same training scheme
- You train multiple models with the same training set, training each of them costs a lot but this is a way to deal with that
- This is the most well known way to simulate ensemble in deep learning models
- Example:
- Random Forest:
- You need some kind of randomization that can come from data, you then build many different models
- In deep learning this is not used because it is not practical to train different models
- Random Forest:
- remember that training a single neural network is very expensive
- For a network like the above each node has some possibility (drop out, or not)
- Each mask counts a different network, it is random and specific to each sample
- During each mini-batch this mask might not been the same also
Dropout: Test time
Dropout makes our output random!
/CSCE-421/Ex2/Visual%20Aids/image.png)
Want to "average out" the randomness at test-time
But this integral seems hard ...
Want to approximate the integral
Consider a single neuron.
At test time we have:
During training we have:
At test time, multiply by (1- dropout) probability
- This number is related to the dropout proabbility
/CSCE-421/Ex2/Visual%20Aids/image-1.png)
Notes:
- In training we have some randomness but in test we have no randomness why>
- In prediction everything needs to be deterministic (no randomness)
- In training time each sample uses a different mask and you need to use every possible mask, each possible mask gives you a network and you will use each of these to make predictions.
- In prediction time you do not dropout any nodes, you use the full network to make predictions and then you multiply by the dropout probability
Regularization: A common pattern
Training: Add some kind of randomness
- Example: Batch Normalization
- Training: Normalize using stats from random minibatches
Testing: Average out randomness (sometimes approximate)
- Testing: Use fixed stats to normalize
Regularization: Data Augmentation
/CSCE-421/Ex2/Visual%20Aids/image-2.png)
Notes:
- Another way to produce equivarianve is to do data augmentation
- We want the network to be invariant to image translation and rotation
- Operations such as flip, rotate, or slightly translate the image
- If we randomly rotate an image by 90 degrees, and tell the network this is a cat, then give it a 180 degrees rotated image, and keep telling the network this is a cat, and so on
- We are doing this by giving examples and trying to teach the network to know rotations are also cats
- But the network might still make different predictions
- During training time we try that our model be invariant, but during prediction time, this is no longer the case why?
/CSCE-421/Ex2/Visual%20Aids/image-3.png)
Random crops and scales
Training: sample random crops / scales
ResNet:
- Pick random
in range [256, 480] - Resize training image, short side
- Sample random
patch
/CSCE-421/Ex2/Visual%20Aids/image-4.png)
Training: sample random crops / scales
ResNet:
- Pick random
in range [256, 480] - Resize training image, short side
- Sample random
patch
Testing: average a fixed set of crops
ResNet:
- Resize image at 5 scales:
- For each size, use
crops: 4 corners + center,+ flips
Regularization: A common pattern (summary)
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
/CSCE-421/Ex2/Visual%20Aids/image-5.png)
Transfer Learning
"You need a lot of a data if you want to train/use CNNs"
Transfer Learning with CNNs
/CSCE-421/Ex2/Visual%20Aids/image-6.png)
Notes:
- Lets say you only have 2000 images and you want to teach a network with this set
- The idea is that in many cases if you try a network like the first one, the filters you learn in the lower layers are very similar even for different data sets, since a lot of these objects will have edges, so these first filters will basically give you edge specification
- If you have a very small data set you can take that network and you refine that network depending on those images
- You need to fix many of these layers but mostly you need to change some of the later layers
/CSCE-421/Ex2/Visual%20Aids/image-7.png)
CNN Architectures
Review: LeNet-5
This network will be similar to what will be on the final
/CSCE-421/Ex2/Visual%20Aids/image-8.png)
Notes:
- Network:
- You a convolutional layer
- Pooling layer
- convolutional layer
- pooling layer
- fully connected layer
- This is what you will need to implement in the next homework
- This is the idea from 1990s, at that time there was no GPU training, so this is kind of the best you could do with a CPU
- It is such a simple network and therefore not very effective
- LUSH is a software package that is open source and you can use it to implement these kind of networks
- The L stand for LISP
- It stands for List Processing, it is a very different language
- Coding is very counter intuitive
Case Study: AlexNet
/CSCE-421/Ex2/Visual%20Aids/image-9.png)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- heavy data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate
, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%
Notes:
- The only thing that is new in this paper is the use of ReLU
- He used C++ and implemented convolutional networks on GPU for the firs time
- There is very messy code
- This is now a much bigger network
- You have a convolutional layer
- Max Pool layer
- Normalization (not batch normalization)
- and so on
- The key about this paper is that they were able to write this code in a GPU so they were able to train a larger model
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
/CSCE-421/Ex2/Visual%20Aids/image-10.png)
Notes:
- All of these are not deep learning
- This was essentially the first demonstration that deep learning can outperform other models!
- Before this no one thought deep learning will work
- This tells us, sometimes if you try to develop something (even if you have a great idea) it is the implementation that will challenge you
Case Study: VGGNet
Small filters, Deeper networks
8 layers (AlexNet)
-> 16-19 layers (VGG16Net)
Only
11.7% top 5 error in ILSVRC'13 (ZFNet)
-> 7.3% top 5 error in ILSVRC'14
/CSCE-421/Ex2/Visual%20Aids/image-11.png)
Notes:
- The only difference in the number is how many layers
- In these networks they only use 3x3 filetrs on top of each other
- There is no linear function between them
- They do this one after the other multiple time because:
- If you want to achieve a larger receptive field, just stick one more convolutional layer
- What you have shown is that it is better in terms of the function
- They are nested, each layer is nested on top of each otehr
- In total the function that can be represented here is larger
- You can use a smaller number of parameters to represent a much complex function
- You do not need to understand this network you only need to know the idea that this is better than applying one layer of convolution with layer filter
- The only thing you need to understand here is that in this networks we only use 3x3 convolutions directly stacked on top of each other (there is no linear function in between)
Q : Why use smaller filters? (
Stack of three [7x7]
But deeper, more non-linearities
And fewer parameters: 3 * (
Notes:
- Think about the number of parameters
- Lets say you have an input feature map
- Lets say you have that feature map 'c' there and you want to generate a 'c' feature map there.
- Lets do this 7x7 conv layer, how many parameters would be there? (lets do not count to bias also)
- This will be 7 * 7 x c * c
- This is 49c^2
- For a 3x3 kernel:
- 3 * 3 * c * c
- This is 9c^2
- If you apply this 3 times you get:
- 3 * 9 * c^2
- This is 27x^2
- That is why we still use this (this idea is still common)
Example:
/CSCE-421/Ex2/Visual%20Aids/image-12.png)
- The blue part tells you the number of parameters of each layer
- What can you see from this number?
- You can see memory will keep reducing, this is because as you go from layer to layer, the feature map size can be reduced over the network
- Note that as the memory reduces, the number of parameters increases!
- What determine the number of parameters here is the number of feature maps
- When you reach the end it becomes a 4096 dimensional vector
- That is also the reason we use only one layer of fully connected layer?
/CSCE-421/Ex2/Visual%20Aids/image-13.png)
Case Study: GoogLeNet
/CSCE-421/Ex2/Visual%20Aids/image-14.png)
Apply parallel filter operations on the input from previous layer:
- Multiple receptive field sizes for convolution (
, ) - Pooling operation (
)
Concatenate all filter outputs together depth-wise
Q: What is the problem with this? [Hint: Computational complexity]
Notes:
- This network itself is no longer useful but there is one important idea about it.
- Normally what we would do is to to apply some convolution to generate output
- What they want to do is to sue filters of different size to each layer (btw this idea is not longer use but it was they initial motivation)
- You cannot generate filters of different size (implementation wise is so difficult to handle)
- Before this, in each layer in any case the size of the feature map will be the same
- This will introduce so many parameters!
- Because you will generate large number of feature maps
- The number of parameters becomes too large
/CSCE-421/Ex2/Visual%20Aids/image-15.png)
-
What they tried to do is to use a 1x1 conv as a "bottleneck"
-
This is a still useful technique!
-
They want to reduce the number of feature maps
-
They will use this 1x1 convolution mainly to reduce the number of feature maps
-
Then the number of parameters will be reduced because for the layer between a 1x1 and a 3x3 convolution, it will be involve less feature maps
-
This will also introduce some parameters right?
- Yes but the key thing is that the feature size is very small, then that will reduce the size of the feature map after the 3x3 convolution
-
If you calculate the number of parameters, the total number of parameters is smaller than if you applied the outputs from the previous layer directly to the 3x3 or 5x5 convolution.
-
I will first use 1x1 convolutions to reduce the number of feature maps, then we apply convolutions
-
That is why this is called bottle neck, you receive a large number of feature maps from the previous layer and you do 1x1 convolution to reduce this number of feature maps, and then you apply convolutions to increment the number of feature maps delivered to the next layer
Stack Inception modules with dimension reduction on top of each other
/CSCE-421/Ex2/Visual%20Aids/image-16.png)
Case Study: ResNet
(Look at slides for more detail)
What happens when we continue stacking deeper layers on a "plain" convolutional neural network?
/CSCE-421/Ex2/Visual%20Aids/image-17.png)
56-layer model performs worse on both training and test error
-> The deeper model performs worse, but it's not caused by overfitting!
Notes:
- Note the 20-layer network is a subset of the 56-layer network
- In theory the 56-layer network has larger capacity than the 20-layer, just because the 56-layer network has a larger training area
- Same thing for test error
Hypothesis: the problem is an optimization problem, deeper models are harder to optimize
The deeper model should be able to perform at least as well as the shallower model.
A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping
/CSCE-421/Ex2/Visual%20Aids/image-19.png)
- Use layers to fit residual
instead of directly
Notes:
- The idea is that sometimes doing nothing is actually better!
- You can see in this network the only thing we did is to copy the input feature maps to the output
- On one pass the input feature map will go through the convolutions and on the second pass it will also be passed directly to the output to be summed together.
- What does this mean?
- Hard requirement:
- You have two paths that have to produce exactly the same number of feature maps of the same size
- This whole thing in general does not change the number and size of feature maps
- Takeaway: same number and size of feature maps in input will be the same as the number and size of feature maps on the output
- "Does not change the number of output feature maps"
- Hard requirement:
- This is the straightforward case, there are more complex cases.
Full ResNet architecture:
- Stack residual blocks
- Every residual block has two 3x3 conv layers
- Periodically, double # of filters and downsample spatially using stride 2 (/2 in each dimension)
- Additional conv layer at the beginning
- No FC layers at the end (only FC 1000 to output classes)
/CSCE-421/Ex2/Visual%20Aids/image-20.png)
Notes:
- Note the 7x7 conv at the start is not important here
- Note each two 3x3 blocks we have a resonate connection
- Two layers of 3x3 convolution
- Note the notation
3x3 conv, 512, /2- Means the number of feature maps got divided by 2?
- After the final global pooling layer, the size of the feature maps become 1x1
- After this pooling layer, you will produce the vector x (a feture map of size 1x1) -> x will be the input for logistic regression.
- FC 1000 is our only fully connected layer
- In each network we need to have at least one layer of fully connected layers
- Then we do softmax
- You can see for these layers (3x3 blocks), because we do padding, they do not change the number of feature maps
- The only tricky thing is the layers of different colors
- Inside each of the same colors, ...
- We basically need to do something between the different colors to make sure each color produce the same number of feature maps
- The maroon one produces 64
- The purple one has 128 as output but the size of the feature map is half the size while the number of feature maps is doubled
- How to increase the number of feature maps? just add more filters
- Inside each of the same colors, ...
- Exam: Some operation needs to be done to reach that
/2. How is it happening that the number of feature maps is doubled and the size of the feature maps is reduced by a factor of 2?- We basically use stride 2 1x1 conv(+BN), double the number of output feature maps
- Note that there needs to be a batch normalization after it.
- Note after each convolutional layer there is batch normalization and there is relu after it:
- The whole set of operations in the above block is:
- Conv + BN + ReLU + conv + BN + addition + ReLU.
- The whole set of operations in the above block is:
- What does it mean to do a 1x1 convolution with stride 2?
- Kernel size is 1x1
- Stride is 2
- Basically you are skipping every other pixel
- The major difference is how many of these blocks you will use, there are networks of hundreds of layers:
- Total depths 34, 50, 101, or 152 layers for ImageNet
"Bottleneck"
For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet)
/CSCE-421/Ex2/Visual%20Aids/image-25.png)
Notes:
- Number of feature maps is reduced to 64 by using a 1x1 convolution, then you apply a 3x3 convolution while keeping the same number of feature maps.
- This entire path does not change the number of feature maps because you again apply the 1x1 convolution after the 3x3 output
- By doing this whole block, you will have a smaller number for parameters than by just applying a 3x3 convolution.
- This is called bottleneck because we go from smaller -> larger -> smaller number of feature maps.
Identity Mappings in Residual Learning
/CSCE-421/Ex2/Visual%20Aids/image-21.png)
- Residual Learning:
identity mapping
Notes:
- There is one more thing that is important
Identity Mappings in Deep Residual Networks
Improving ResNets...
- Improved ResNet block design from creators of ResNet
- Creates a more direct path for propagating information throughout network (moves activation to residual mapping pathway)
- Gives better performance
/CSCE-421/Ex2/Visual%20Aids/image-22.png)
There is one more thing that is important:
- Sequence of operations (one block):
- Conv + BN + ReLU + conv + BN + addition + ReLU.
- If you only follow this skip connection what ... is this block?
- You still need to pad with ReLU every time.
Notes:
- If you look at a whole network there are some layers stacked on top of each other
- You have many blocks on top of each other
- When you put a skip connection we need to start somewhere and end somewhere
- The discussion is where we start this skip connection and where we end it?
- Where exactly do we consider a certain sequence of layers a single block and end bound our skip connection based on that block?
- In this example if you follow the skip connections you will have ReLU at the end of each block
- What if we shift the boundaries of the skip connection one layer upward (red line)
- If you do that, the left image is what you would have
- Now there is no ReLU in between each of the block, ReLU is now inside.
- There are no guidelines on why this is better than having ReLU at then end of a block but is an ongoing discussion and this is more used in practice
- This is just our view, in reality the network is a bunch of blocks on to of each other
Things to takeaway:
- What is the skip operation (ResNet)
- Bottleneck
- Identity mapping
Question 1: What if shortcut mapping ?
(This is not in exam/not in homework)
/CSCE-421/Ex2/Visual%20Aids/image-23.png)
/CSCE-421/Ex2/Visual%20Aids/image-24.png)
What this paper propose is to make that skip connection along that identity.
Turns out that there is one person who did something similar but added more operations on every block (making it more complex), and it ended up not working.
- constant scaling: basically multiplying by certain number, making the error to scale too!
- The idea of gate is quite popular and you can get in in many ways:
- Imagine 1 feature map as input, 1 feature map as output with just an applied convolution in between.
- If you have some object in feature map A (obviously you want to recognize it), it is more important, we need to choose some feature map to capture the important features of the object.
- The idea of gate is that you will have a side pass that will do some operations (most likely a convolution), it produces a feature map and applies sigmoid element-wise to every location in the new feature map on the side pass, so you get another map where everything is a number between 0-1. Then on the main path you do a regular convolution, and at the end you multiply element-wise the 0-1 feature map with the main path new feature map.
- This is essentially doing a mask! It is very nice but in reality, now this wouldn't work, just because a mask is not smart enough why?
- How can you learn the parameter before the side pass to capture the correct features of the object to recognize?
- There is not enough information for you to know this.
- Mask segmentation is a much difficult task in between.
- You are not told what are the locations of the object to recognize in the original feature map, this is a difficult task and most times you will make mistakes on the locations you are interested on.
- This is exactly what was done in that 1990 paper -> if you add a gate it won't work
- But there are a tons of different gate.
- This is a debate, some people used this long time before what we used now.
Identity Mappings in Residual Learning
/CSCE-421/Ex2/Visual%20Aids/image-26.png)