HW3 - Convolutional Networks

Class: CSCE-421


Notes:

Question 1

A single (15 × 15 × 3) image is passed through a convolutional layer with 28 filters, each of size (3 × 3 × 3). The padding size is 1 (1 unit at top, bottom, left, and right) and the stride size is also 1. What is the size of the output feature map volume? What is the number of parameters in this layer (including bias)? Note that, for simplicity, we consider filters of 3 × 3 × 3 as one filter, instead of three filters of size 3 × 3.


What is a Convolution?

To understand how Convolutional Neural Networks (CNNs) work from scratch, we first need to understand how a computer sees an image and how we can extract patterns from it.

To a computer, an image is simply a two-dimensional array of numbers. In a CNN, we use something called a filter (sometimes called a kernel), which is a much smaller box of numbers. The core idea of a convolution is to take this small filter and slide it across the input image. At every step, the network performs an element-wise multiplication between the numbers in the filter and the numbers in that specific patch of the image, and then sums them all up to produce a single number.

These filters are designed to detect important visual features, like vertical or horizontal edges, which combine to form shapes and objects. The numbers inside these filters are the parameters that the neural network actually learns from the data during training.

Part 1: What is the size of the output feature map volume?

To find the volume of the output, we need to calculate its spatial dimensions (Height and Width) and its depth (Number of output channels/slices).

You are given a single image of size:

15×15×15

This means:

The layer has 28 filters, each of size:

3×3×3

That means:

This is important:

Spatial vs. Depth Dimensions

Dimension Filter Behavior Purpose
Width (W) Slides (Stride) Locates features horizontally.
Height (H) Slides (Stride) Locates features vertically.
Depth (D) Fixed (Full Depth) Combines channel data into a new feature.

1. Spatial Dimensions (Height and Width)

When you slide a filter over an image, the output size naturally shrinks, and the pixels on the extreme borders are not treated as fairly as the pixels in the center. To prevent the image from shrinking too quickly, we use padding, which means we artificially add a border of zero-value pixels around the outside of the input image.

Your original image is 15×15. Because the problem states there is a padding of 1, you are adding 1 pixel to the top, 1 to the bottom, 1 to the left, and 1 to the right. This makes your "effective" input size 15+1+1=17 for both height and width.

The problem also mentions a stride of 1. Stride simply dictates how many steps the filter takes when it slides across the image.

To calculate the exact size of the output, your professor provided this formula:

Output Size=Input SizeFilter SizeStride+1

Let's plug your numbers into this formula for both height and width:

Output Dimension=1731+1=15

Notice how using a padding of 1 with a 3×3 filter perfectly preserved your original 15×15 image size.

2. Depth (Number of Channels)

Every time you slide a single filter across the entire image, it generates one completely independent 2D output slice. Because your layer uses 28 filters, the network will generate 28 completely independent output slices and stack them together.

In other words: Each filter produces one feature map. Since there are 28 filters, the output depth will be: 28

Final Answer for Part 1: Combining the height, width, and depth, the final output feature map volume is

15×15×28

Part 2: What is the number of parameters in this layer (including bias)?

A "parameter" is a specific weight or number that the network has to learn. To find the total number of parameters in this layer, we need to calculate the parameters for just one filter, and then multiply that by the total number of filters.

1. Parameters in a single filter Color images are not just flat grids; they have depth because they are made of 3 color channels (Red, Green, Blue). Therefore, a filter must also have depth so it can connect to every input channel at the same time. This is why your filter size is given as 3×3×3 (Height × Width × Input Channels).

To find the number of weights in one filter, you multiply those dimensions:

Additionally, every single filter in a neural network gets exactly 1 bias parameter added to it, which acts as a threshold.

2. Total parameters in the layer You have 28 of these filters in total. Because each output slice is generated completely independently, every single filter has its own unique set of parameters. For this convolutional layer we have:

Final Answer for Part 2: There are 784 parameters in this convolutional layer.

Question 2

In this question, you can (a) assume padding of appropriate size is used in convolutional layers, and (b) ignore batch normalization. Given the residual block as below:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/HWs/Visual Aids/image.png320

1. Skip connection

What projection shortcut operations are required on the skip connection?


This question introduces one of the most famous breakthroughs in modern deep learning: the Residual Network (ResNet).

The Basics: Feature Maps and Stride

What is a Residual Block and a Skip Connection?

In traditional neural networks, data flows straight down a single path, layer by layer. However, researchers found that if you make a network too deep (e.g., 56 layers), the performance actually gets worse because it becomes too mathematically difficult to optimize

To fix this, researchers invented the Residual Block. Instead of forcing the network to learn a completely new transformation from scratch at every layer, they added a second path called a skip connection (the long arrow going down the right side of your diagram)

At the very bottom of the block, you hit the "add" operation. The network takes the newly transformed data from the left path and adds it element-by-element to the original data from the right path.

The Problem: The "Add" Crash

Here is the golden rule of residual blocks: To add two volumes of data together, they must have the exact same spatial size (height and width) and the exact same number of feature maps (depth).

Let's track the data through your specific diagram to see why this rule creates a massive problem:

The Crash: At the "add" node, the network tries to add a 16 × 16 × 256 block (from the left) to a 32 × 32 × 128 block (from the right). Because the dimensions do not match at all, the math crashes.

The Solution: The Projection Shortcut

To fix this crash, we cannot just blindly copy the input down the skip connection. We have to apply a quick operation to the skip connection to force its dimensions to perfectly match the main path. This is what we call a "projection shortcut."

We need to fix two things on the skip connection:

  1. Fixing the Depth: We need to increase the number of feature maps from 128 to 256. To do this without messing up the actual image patterns, we use a 1x1 convolution with 256 filters. A 1x1 convolution is specifically used to change the number of feature maps.
  2. Fixing the Spatial Size: We need to shrink the height and width by a factor of 2. We do this by applying a stride of 2 to that same 1x1 convolution.

By putting a 1x1 convolution with a stride of 2 on the skip connection, the right path will output a 16 × 16 × 256 block. Now, both paths match perfectly, and the "add" operation will succeed!.

Mathematically

In a Residual Network, the standard formula for a block is y=F(x)+x. However, when the dimensions of the main path F(x) and the input x do not match, we must apply a linear projection Ws to the shortcut so that y=F(x)+Ws(x).

Here is a mathematical way to write your answer using LaTeX, which incorporates the tensor dimensions and the output size formula from your notes:

Let the input to the residual block be the tensor xRH×W×128. The main convolutional path downsamples the spatial dimensions and increases the filters, yielding an output F(x)RH2×W2×256.

To perform the element-wise residual addition y=F(x)+Ws(x), the projection shortcut Ws must transform x to match the exact dimensions of F(x).

2. Number of trainable parameters

What is the total number of trainable parameters in the block (you can ignore bias terms, but need to consider the skip connection)?


The Core Formula for Trainable Parameters

In #Question 1, we learned that the trainable parameters are the actual numbers (weights) inside the filters that the computer has to learn.

Since the problem explicitly says we can ignore the bias terms, our final mathematical formula is simply: Parameters = (Filter Height × Filter Width × Input Feature Maps) × Output Feature Maps

(Note: You will notice that stride is not in this formula. Stride only changes how the filter moves across the image; it does not change the physical size of the filter itself, so it does not affect the number of parameters!)

Now, let's apply this formula to the three distinct convolutional layers in your residual block.

Step 1: The First Convolution on the Main Path

Looking at the left side of your diagram, the data first passes through a 3×3 convolution to create 256 feature maps.

Let's plug this into our formula:


Step 2: The Second Convolution on the Main Path

The data continues down the left side into the second 3×3 convolution.

Let's plug this into our formula:


Step 3: The Skip Connection (Projection Shortcut)

Remember from the previous question that the skip connection (the right path) cannot just be an empty wire. Because the dimensions didn't match, we had to add a 1×1 convolution to it to increase the feature maps from 128 to 256. These 1×1 filters also contain learnable parameters!

Let's plug this into our formula:


Step 4: The Final Total

To find the total number of trainable parameters in the entire residual block, we simply add the parameters from all three of these convolutions together:

Final Answer Summary:

The total number of trainable parameters in this residual block is 917,504. We calculate this by summing the parameters of the three convolutional operations (using the formula Height × Width × Input Depth × Output Filters without bias):

  1. First 3×3 layer: (3×3×128)×256=294,912
  2. Second 3×3 layer: (3×3×256)×256=589,824
  3. Skip Connection (1×1 layer): (1×1×128)×256=32,768

Question 3

Using batch normalization in neural networks requires computing the mean and variance of a tensor. Suppose a batch normalization layer takes vectors z1,z2,,zm as input, where m is the mini-batch size. It computes z^1,z^2,,z^m according to

z^i=ziμσ2+ϵ

where

μ=1mi=1mzi,σ2=1mi=1m(ziμ)2.

It then applies a second transformation to obtain z~1,z~2,,z~m using learned parameters γ and β as

z~i=γz^i+β.

In this question, you can assume that ϵ=0.

Part 1

  1. (5 points) You forward-propagate a mini-batch of m=4 examples in your network. Suppose you are at a batch normalization layer, where the immediately previous layer is a fully connected layer with 3 units. Therefore, the input to this batch normalization layer can be represented as the below matrix:
[121414120101005555]

What are z^i? Please express your answer in a 3×4 matrix.

1. The Concepts: What is Batch Normalization?

Batch Normalization is considered one of the most important breakthroughs in Deep Learning because it allows networks to train much faster and more stably.

The Problem: As data passes through the many layers of a neural network, the scale of the numbers can get messy. One node (unit) might output values in the thousands, while another node outputs decimals. If the numbers are on completely different scales, the network struggles to learn, and the learning process can oscillate or diverge.

The Solution: To fix this, we force the outputs of the layers to be on a standard, predictable scale. Specifically, we want the data coming out of each unit to have a mean (average) of 0 and a variance (spread) of 1. Your professor’s notes describe this perfectly: "you want zero-mean unit-variance activations? just make them so".

How it works (Mini-Batches & Dimensions):

The Golden Rule of Batch Norm: Batch Normalization is calculated independently for each dimension (unit). This means you do not calculate the average of the whole matrix. You calculate the average and variance for Row 1 independently, then Row 2 independently, and then Row 3 independently.

2. Why Batch Normalization?

Why do we use Mini-Batches instead of the entire dataset?

To understand this, we need to quickly review how a neural network learns. The network looks at the data, makes a prediction, calculates how wrong it is (the error), and then updates its internal weights to be more accurate next time.

If you have a dataset of 1 million images, you have three choices for how to feed that data to the network:

1. Hardware and Memory Limits (The physical reason) The most straightforward reason we cannot pass the entire dataset at once is that computers simply do not have enough memory to hold it. When a neural network processes data, it has to store all the intermediate math calculations for every single image in the computer's graphics card (GPU) memory.

2. Learning Speed (The mathematical reason) If you use Batch Gradient Descent (passing the entire dataset at once), the network will process all 1 million images, calculate the total error, and then take a single update step. This means your network spent a massive amount of computational power just to learn one single thing.

Why is Batch Normalization calculated independently for each dimension (unit)?

To understand this, we need to think about what the numbers coming out of those units actually represent.

1. The Danger of Mixed Scales
Imagine a network trying to predict heart attacks. The data passing through the network contains entirely different types of features: one unit might process a person's age (e.g., 62), while another unit processes their annual salary (e.g., 40,000).

2. The Goal: A "Similar Pace" for Everything
To fix this, we want to force every single piece of data to speak the exact same mathematical language. The goal of Batch Normalization is to guarantee that the data coming out of every single unit has a mean (average) of exactly 0 and a variance (spread) of exactly 1. This ensures that all features update at a "similar pace".

3. Why it MUST be calculated independently
If we took the entire layer of units and calculated one giant average for all of them combined, the 40,000 salary numbers would drag the average way up. If we then subtracted that giant average from the age unit, a 62-year-old would suddenly be represented by a massive negative number! The scales would still be completely ruined.

Therefore, the only way to ensure every feature is on a level playing field is to compute the empirical mean and variance independently for each dimension (unit).

3. Solving the Math Step-by-Step

Let's apply the formulas provided in your question to each row individually.

Unit 1 (Row 1)

The raw signals for the first unit across the 4 examples are: z(1)=

[12141412]

Step A: Calculate the Mean (μ) Add them up and divide by m=4.

Step B: Calculate the Variance (σ2) Subtract the mean from each number, square the result, and average them.

Step C: Normalize (z^) Subtract the mean and divide by the square root of the variance (which is the standard deviation). Note: The problem says to assume ϵ=0.


Unit 2 (Row 2)

The raw signals for the second unit are: z(2)=.

[010100]

Step A: Calculate the Mean (μ)

Step B: Calculate the Variance (σ2)

Step C: Normalize (z^)


Unit 3 (Row 3)

The raw signals for the third unit are: z(3)=[5,5,5,5].

Step A: Calculate the Mean (μ)

Step B: Calculate the Variance (σ2)

Step C: Normalize (z^)


📝 Final Answer

By stacking our normalized rows back together, the final normalized 3×4 matrix Z^ is:

Z^=[1111 1111 1111]

(Notice how, despite starting with completely different ranges of numbers in the original matrix, the Batch Normalization successfully squashed every single row into the exact same standardized scale!)

The Vector Approach (The Professor's Formula)

In your professor's formula, the zi terms are column vectors. In your specific problem, they look like this:

Look closely at the professor's formula for the mean: μ=1mi=1mzi.
Because you are adding vectors together, you must follow the rules of linear algebra, which dictate that vector addition is performed element-wise (row by row).

Let's plug the vectors into the professor's formula:
μ=14([1205]+[14105]+[14105]+[1205])

When you add those columns together, you add the top row together, the middle row together, and the bottom row together:
μ=14[12+14+14+120+10+10+05+5+5+5]=14[52200]=[1350]

Notice what just happened! The resulting mean μ is a vector containing exactly the three numbers we found when we calculated it row-by-row.

Why did I explain it row-by-row?

I broke it down row-by-row because your professor's notes explicitly state the golden rule of Batch Normalization: "compute the empirical mean and variance independently for each dimension".

If you look at later slides, the professor actually expands the notation to show this explicitly by using two indices: x^i,j=xi,jμjσj2+ϵ. In this expanded notation:

Summary

The Vectorized Approach: Computation

Instead of calculating row-by-row, we can use the formal vector definitions of Batch Normalization, treating each example in the mini-batch as a full column vector. Operations like addition, subtraction, squaring, and division are performed element-wise.

Step 1: Define the Input Vectors (zi) Separate the input matrix into m=4 column vectors, where each vector represents one example in the mini-batch: $$z_1 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}, \quad z_2 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_3 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_4 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}$$

Step 2: Calculate the Mean Vector (μ) Add all column vectors together and divide by m: $$\mu = \frac{1}{4} (z_1 + z_2 + z_3 + z_4) = \frac{1}{4} \begin{bmatrix} 12+14+14+12 \ 0+10+10+0 \ -5+5+5-5 \end{bmatrix} = \begin{bmatrix} 13 \ 5 \ 0 \end{bmatrix}$$

Step 3: Calculate the Variance Vector (σ2) Subtract the mean vector μ from each input vector, square the resulting elements, and average them: $$\sigma^2 = \frac{1}{4} \left( (z_1 - \mu)^2 + (z_2 - \mu)^2 + (z_3 - \mu)^2 + (z_4 - \mu)^2 \right)$$ $$\sigma^2 = \frac{1}{4} \left( \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 \right) = \frac{1}{4} \left( \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} \right) = \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix}$$

Step 4: Normalize Each Vector (z^i) Apply the normalization formula z^i=ziμσ2 (assuming ϵ=0) using element-wise division. The standard deviation vector is σ2=[1 5 5].

Step 5: Reconstruct the Final Matrix (Z^) Stack the resulting column vectors back together to form the final 3×4 normalized matrix: $$\hat{Z} = [\hat{z}_1, \hat{z}_2, \hat{z}_3, \hat{z}_4] = \left[\begin{array}{cccc} -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \end{array}\right]$$

Part 2

Continue with the above setting. Suppose γ=(1,1,1), and β=(0,10,10). What are z~i ? Please express your answer in a 3×4 matrix.


Now that you have successfully normalized the data (giving it a mean of 0 and a variance of 1), you are ready for the second half of the Batch Normalization process: Scaling and Shifting.

Here is the step-by-step explanation of why we do this and how to solve the math.

1. The Concept: Why Scale and Shift?

In the previous question, we squashed the output of every single unit so that it perfectly centered around 0 with a spread of 1. However, forcing every single layer to have the exact same rigid scale can sometimes be too restrictive and actually hurt the neural network's ability to learn complex patterns.

To fix this, the creators of Batch Normalization added a clever trick: after we normalize the data, we give the network the power to scale and shift the data into whatever range it actually needs.

If the network decides that the strict 0-mean and 1-variance was a bad idea, it can use γ and β to completely reverse the normalization and recover the original raw data.

2. Breaking Down the Parameters

Just like the mean and variance, the scaling (γ) and shifting (β) are applied independently to each dimension (unit/row).

The problem states that γ=(1,1,1) and β=(0,10,10). Because there are 3 units (rows) in our network, these vectors contain 3 numbers. Here is how they match up to our rows:

3. Solving the Math Step-by-Step

Let's take the normalized matrix Z^ we calculated in the previous question: $$ \hat{Z} = \left[\begin{array}{cccc} -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \end{array}\right] $$

We will now apply the formula z~i=γz^i+β to each row element-by-element.

Unit 1 (Row 1):

Unit 2 (Row 2):

Unit 3 (Row 3):


📝 Final Answer

By stacking our newly scaled and shifted rows back together, the final output matrix Z~ for the Batch Normalization layer is:

Z~=[1111 119911 911119]