HW3 - Convolutional Networks

Class: CSCE-421


Notes:

Question 1

A single (15 × 15 × 3) image is passed through a convolutional layer with 28 filters, each of size (3 × 3 × 3). The padding size is 1 (1 unit at top, bottom, left, and right) and the stride size is also 1. What is the size of the output feature map volume? What is the number of parameters in this layer (including bias)? Note that, for simplicity, we consider filters of 3 × 3 × 3 as one filter, instead of three filters of size 3 × 3.


What is a Convolution?

To understand how Convolutional Neural Networks (CNNs) work from scratch, we first need to understand how a computer sees an image and how we can extract patterns from it.

To a computer, an image is simply a two-dimensional array of numbers. In a CNN, we use something called a filter (sometimes called a kernel), which is a much smaller box of numbers. The core idea of a convolution is to take this small filter and slide it across the input image. At every step, the network performs an element-wise multiplication between the numbers in the filter and the numbers in that specific patch of the image, and then sums them all up to produce a single number.

These filters are designed to detect important visual features, like vertical or horizontal edges, which combine to form shapes and objects. The numbers inside these filters are the parameters that the neural network actually learns from the data during training.

Part 1: What is the size of the output feature map volume?

To find the volume of the output, we need to calculate its spatial dimensions (Height and Width) and its depth (Number of output channels/slices).

You are given a single image of size:

15×15×15

This means:

The layer has 28 filters, each of size:

3×3×3

That means:

This is important:

Spatial vs. Depth Dimensions

Dimension Filter Behavior Purpose
Width (W) Slides (Stride) Locates features horizontally.
Height (H) Slides (Stride) Locates features vertically.
Depth (D) Fixed (Full Depth) Combines channel data into a new feature.

1. Spatial Dimensions (Height and Width)

When you slide a filter over an image, the output size naturally shrinks, and the pixels on the extreme borders are not treated as fairly as the pixels in the center. To prevent the image from shrinking too quickly, we use padding, which means we artificially add a border of zero-value pixels around the outside of the input image.

Your original image is 15×15. Because the problem states there is a padding of 1, you are adding 1 pixel to the top, 1 to the bottom, 1 to the left, and 1 to the right. This makes your "effective" input size 15+1+1=17 for both height and width.

The problem also mentions a stride of 1. Stride simply dictates how many steps the filter takes when it slides across the image.

To calculate the exact size of the output, your professor provided this formula:

Output Size=Input SizeFilter SizeStride+1

Let's plug your numbers into this formula for both height and width:

Output Dimension=1731+1=15

Notice how using a padding of 1 with a 3×3 filter perfectly preserved your original 15×15 image size.

2. Depth (Number of Channels)

Every time you slide a single filter across the entire image, it generates one completely independent 2D output slice. Because your layer uses 28 filters, the network will generate 28 completely independent output slices and stack them together.

In other words: Each filter produces one feature map. Since there are 28 filters, the output depth will be: 28

Final Answer for Part 1: Combining the height, width, and depth, the final output feature map volume is

15×15×28

Part 2: What is the number of parameters in this layer (including bias)?

A "parameter" is a specific weight or number that the network has to learn. To find the total number of parameters in this layer, we need to calculate the parameters for just one filter, and then multiply that by the total number of filters.

1. Parameters in a single filter Color images are not just flat grids; they have depth because they are made of 3 color channels (Red, Green, Blue). Therefore, a filter must also have depth so it can connect to every input channel at the same time. This is why your filter size is given as 3×3×3 (Height × Width × Input Channels).

To find the number of weights in one filter, you multiply those dimensions:

Additionally, every single filter in a neural network gets exactly 1 bias parameter added to it, which acts as a threshold.

2. Total parameters in the layer You have 28 of these filters in total. Because each output slice is generated completely independently, every single filter has its own unique set of parameters. For this convolutional layer we have:

Final Answer for Part 2: There are 784 parameters in this convolutional layer.

Question 2

In this question, you can (a) assume padding of appropriate size is used in convolutional layers, and (b) ignore batch normalization. Given the residual block as below:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/HWs/Visual Aids/image.png320

1. Skip connection

What projection shortcut operations are required on the skip connection?


This question introduces one of the most famous breakthroughs in modern deep learning: the Residual Network (ResNet).

The Basics: Feature Maps and Stride

What is a Residual Block and a Skip Connection?

In traditional neural networks, data flows straight down a single path, layer by layer. However, researchers found that if you make a network too deep (e.g., 56 layers), the performance actually gets worse because it becomes too mathematically difficult to optimize

To fix this, researchers invented the Residual Block. Instead of forcing the network to learn a completely new transformation from scratch at every layer, they added a second path called a skip connection (the long arrow going down the right side of your diagram)

At the very bottom of the block, you hit the "add" operation. The network takes the newly transformed data from the left path and adds it element-by-element to the original data from the right path.

The Problem: The "Add" Crash

Here is the golden rule of residual blocks: To add two volumes of data together, they must have the exact same spatial size (height and width) and the exact same number of feature maps (depth).

Let's track the data through your specific diagram to see why this rule creates a massive problem:

The Crash: At the "add" node, the network tries to add a 16 × 16 × 256 block (from the left) to a 32 × 32 × 128 block (from the right). Because the dimensions do not match at all, the math crashes.

The Solution: The Projection Shortcut

To fix this crash, we cannot just blindly copy the input down the skip connection. We have to apply a quick operation to the skip connection to force its dimensions to perfectly match the main path. This is what we call a "projection shortcut."

We need to fix two things on the skip connection:

  1. Fixing the Depth: We need to increase the number of feature maps from 128 to 256. To do this without messing up the actual image patterns, we use a 1x1 convolution with 256 filters. A 1x1 convolution is specifically used to change the number of feature maps.
  2. Fixing the Spatial Size: We need to shrink the height and width by a factor of 2. We do this by applying a stride of 2 to that same 1x1 convolution.

By putting a 1x1 convolution with a stride of 2 on the skip connection, the right path will output a 16 × 16 × 256 block. Now, both paths match perfectly, and the "add" operation will succeed!.

Mathematically

In a Residual Network, the standard formula for a block is y=F(x)+x. However, when the dimensions of the main path F(x) and the input x do not match, we must apply a linear projection Ws to the shortcut so that y=F(x)+Ws(x).

Here is a mathematical way to write your answer using LaTeX, which incorporates the tensor dimensions and the output size formula from your notes:

Let the input to the residual block be the tensor xRH×W×128. The main convolutional path downsamples the spatial dimensions and increases the filters, yielding an output F(x)RH2×W2×256.

To perform the element-wise residual addition y=F(x)+Ws(x), the projection shortcut Ws must transform x to match the exact dimensions of F(x).

2. Number of trainable parameters

What is the total number of trainable parameters in the block (you can ignore bias terms, but need to consider the skip connection)?


The Core Formula for Trainable Parameters

In #Question 1, we learned that the trainable parameters are the actual numbers (weights) inside the filters that the computer has to learn.

Since the problem explicitly says we can ignore the bias terms, our final mathematical formula is simply: Parameters = (Filter Height × Filter Width × Input Feature Maps) × Output Feature Maps

(Note: You will notice that stride is not in this formula. Stride only changes how the filter moves across the image; it does not change the physical size of the filter itself, so it does not affect the number of parameters!)

Now, let's apply this formula to the three distinct convolutional layers in your residual block.

Step 1: The First Convolution on the Main Path

Looking at the left side of your diagram, the data first passes through a 3×3 convolution to create 256 feature maps.

Let's plug this into our formula:


Step 2: The Second Convolution on the Main Path

The data continues down the left side into the second 3×3 convolution.

Let's plug this into our formula:


Step 3: The Skip Connection (Projection Shortcut)

Remember from the previous question that the skip connection (the right path) cannot just be an empty wire. Because the dimensions didn't match, we had to add a 1×1 convolution to it to increase the feature maps from 128 to 256. These 1×1 filters also contain learnable parameters!

Let's plug this into our formula:


Step 4: The Final Total

To find the total number of trainable parameters in the entire residual block, we simply add the parameters from all three of these convolutions together:

Final Answer Summary:

The total number of trainable parameters in this residual block is 917,504. We calculate this by summing the parameters of the three convolutional operations (using the formula Height × Width × Input Depth × Output Filters without bias):

  1. First 3×3 layer: (3×3×128)×256=294,912
  2. Second 3×3 layer: (3×3×256)×256=589,824
  3. Skip Connection (1×1 layer): (1×1×128)×256=32,768

Question 3

Using batch normalization in neural networks requires computing the mean and variance of a tensor. Suppose a batch normalization layer takes vectors z1,z2,,zm as input, where m is the mini-batch size. It computes z^1,z^2,,z^m according to

z^i=ziμσ2+ϵ

where

μ=1mi=1mzi,σ2=1mi=1m(ziμ)2.

It then applies a second transformation to obtain z~1,z~2,,z~m using learned parameters γ and β as

z~i=γz^i+β.

In this question, you can assume that ϵ=0.

Part 1

  1. (5 points) You forward-propagate a mini-batch of m=4 examples in your network. Suppose you are at a batch normalization layer, where the immediately previous layer is a fully connected layer with 3 units. Therefore, the input to this batch normalization layer can be represented as the below matrix:
[121414120101005555]

What are z^i? Please express your answer in a 3×4 matrix.


1. The Concepts: What is Batch Normalization?

Batch Normalization is considered one of the most important breakthroughs in Deep Learning because it allows networks to train much faster and more stably.

The Problem: As data passes through the many layers of a neural network, the scale of the numbers can get messy. One node (unit) might output values in the thousands, while another node outputs decimals. If the numbers are on completely different scales, the network struggles to learn, and the learning process can oscillate or diverge.

The Solution: To fix this, we force the outputs of the layers to be on a standard, predictable scale. Specifically, we want the data coming out of each unit to have a mean (average) of 0 and a variance (spread) of 1. Your professor’s notes describe this perfectly: "you want zero-mean unit-variance activations? just make them so".

How it works (Mini-Batches & Dimensions):

The Golden Rule of Batch Norm: Batch Normalization is calculated independently for each dimension (unit). This means you do not calculate the average of the whole matrix. You calculate the average and variance for Row 1 independently, then Row 2 independently, and then Row 3 independently.

2. Why Batch Normalization?

Why do we use Mini-Batches instead of the entire dataset?

To understand this, we need to quickly review how a neural network learns. The network looks at the data, makes a prediction, calculates how wrong it is (the error), and then updates its internal weights to be more accurate next time.

If you have a dataset of 1 million images, you have three choices for how to feed that data to the network:

1. Hardware and Memory Limits (The physical reason) The most straightforward reason we cannot pass the entire dataset at once is that computers simply do not have enough memory to hold it. When a neural network processes data, it has to store all the intermediate math calculations for every single image in the computer's graphics card (GPU) memory.

2. Learning Speed (The mathematical reason) If you use Batch Gradient Descent (passing the entire dataset at once), the network will process all 1 million images, calculate the total error, and then take a single update step. This means your network spent a massive amount of computational power just to learn one single thing.

Why is Batch Normalization calculated independently for each dimension (unit)?

To understand this, we need to think about what the numbers coming out of those units actually represent.

1. The Danger of Mixed Scales
Imagine a network trying to predict heart attacks. The data passing through the network contains entirely different types of features: one unit might process a person's age (e.g., 62), while another unit processes their annual salary (e.g., 40,000).

2. The Goal: A "Similar Pace" for Everything
To fix this, we want to force every single piece of data to speak the exact same mathematical language. The goal of Batch Normalization is to guarantee that the data coming out of every single unit has a mean (average) of exactly 0 and a variance (spread) of exactly 1. This ensures that all features update at a "similar pace".

3. Why it MUST be calculated independently
If we took the entire layer of units and calculated one giant average for all of them combined, the 40,000 salary numbers would drag the average way up. If we then subtracted that giant average from the age unit, a 62-year-old would suddenly be represented by a massive negative number! The scales would still be completely ruined.

Therefore, the only way to ensure every feature is on a level playing field is to compute the empirical mean and variance independently for each dimension (unit).

3. Solving the Math Step-by-Step

Let's apply the formulas provided in your question to each row individually.

Unit 1 (Row 1)

The raw signals for the first unit across the 4 examples are: z(1)=

[12141412]

Step A: Calculate the Mean (μ) Add them up and divide by m=4.

Step B: Calculate the Variance (σ2) Subtract the mean from each number, square the result, and average them.

Step C: Normalize (z^) Subtract the mean and divide by the square root of the variance (which is the standard deviation). Note: The problem says to assume ϵ=0.


Unit 2 (Row 2)

The raw signals for the second unit are: z(2)=.

[010100]

Step A: Calculate the Mean (μ)

Step B: Calculate the Variance (σ2)

Step C: Normalize (z^)


Unit 3 (Row 3)

The raw signals for the third unit are: z(3)=[5,5,5,5].

Step A: Calculate the Mean (μ)

Step B: Calculate the Variance (σ2)

Step C: Normalize (z^)


Final Answer

By stacking our normalized rows back together, the final normalized 3×4 matrix Z^ is:

Z^=[111111111111]

(Notice how, despite starting with completely different ranges of numbers in the original matrix, the Batch Normalization successfully squashed every single row into the exact same standardized scale!)

The Vector Approach (The Professor's Formula)

In your professor's formula, the zi terms are column vectors. In your specific problem, they look like this:

Look closely at the professor's formula for the mean: μ=1mi=1mzi.
Because you are adding vectors together, you must follow the rules of linear algebra, which dictate that vector addition is performed element-wise (row by row).

Let's plug the vectors into the professor's formula:
μ=14([1205]+[14105]+[14105]+[1205])

When you add those columns together, you add the top row together, the middle row together, and the bottom row together:
μ=14[12+14+14+120+10+10+05+5+5+5]=14[52200]=[1350]

Notice what just happened! The resulting mean μ is a vector containing exactly the three numbers we found when we calculated it row-by-row.

Why did I explain it row-by-row?

I broke it down row-by-row because your professor's notes explicitly state the golden rule of Batch Normalization: "compute the empirical mean and variance independently for each dimension".

If you look at later slides, the professor actually expands the notation to show this explicitly by using two indices: x^i,j=xi,jμjσj2+ϵ. In this expanded notation:

Summary

The Vectorized Approach: Computation

Instead of calculating row-by-row, we can use the formal vector definitions of Batch Normalization, treating each example in the mini-batch as a full column vector. Operations like addition, subtraction, squaring, and division are performed element-wise.

Step 1: Define the Input Vectors (zi) Separate the input matrix into m=4 column vectors, where each vector represents one example in the mini-batch: $$z_1 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}, \quad z_2 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_3 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_4 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}$$

Step 2: Calculate the Mean Vector (μ) Add all column vectors together and divide by m: $$\mu = \frac{1}{4} (z_1 + z_2 + z_3 + z_4) = \frac{1}{4} \begin{bmatrix} 12+14+14+12 \ 0+10+10+0 \ -5+5+5-5 \end{bmatrix} = \begin{bmatrix} 13 \ 5 \ 0 \end{bmatrix}$$

Step 3: Calculate the Variance Vector (σ2) Subtract the mean vector μ from each input vector, square the resulting elements, and average them: $$\sigma^2 = \frac{1}{4} \left( (z_1 - \mu)^2 + (z_2 - \mu)^2 + (z_3 - \mu)^2 + (z_4 - \mu)^2 \right)$$ $$\sigma^2 = \frac{1}{4} \left( \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 \right) = \frac{1}{4} \left( \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} \right) = \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix}$$

Step 4: Normalize Each Vector (z^i) Apply the normalization formula z^i=ziμσ2 (assuming ϵ=0) using element-wise division. The standard deviation vector is σ2=[1 5 5].

Step 5: Reconstruct the Final Matrix (Z^) Stack the resulting column vectors back together to form the final 3×4 normalized matrix: $$\hat{Z} = [\hat{z}_1, \hat{z}_2, \hat{z}_3, \hat{z}_4] = \left[\begin{array}{cccc} -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \end{array}\right]$$

Part 2

Continue with the above setting. Suppose γ=(1,1,1), and β=(0,10,10). What are z~i ? Please express your answer in a 3×4 matrix.


Now that you have successfully normalized the data (giving it a mean of 0 and a variance of 1), you are ready for the second half of the Batch Normalization process: Scaling and Shifting.

Here is the step-by-step explanation of why we do this and how to solve the math.

1. The Concept: Why Scale and Shift?

In the previous question, we squashed the output of every single unit so that it perfectly centered around 0 with a spread of 1. However, forcing every single layer to have the exact same rigid scale can sometimes be too restrictive and actually hurt the neural network's ability to learn complex patterns.

To fix this, the creators of Batch Normalization added a clever trick: after we normalize the data, we give the network the power to scale and shift the data into whatever range it actually needs.

If the network decides that the strict 0-mean and 1-variance was a bad idea, it can use γ and β to completely reverse the normalization and recover the original raw data.

2. Breaking Down the Parameters

Just like the mean and variance, the scaling (γ) and shifting (β) are applied independently to each dimension (unit/row).

The problem states that γ=(1,1,1) and β=(0,10,10). Because there are 3 units (rows) in our network, these vectors contain 3 numbers. Here is how they match up to our rows:

3. Solving the Math Step-by-Step

Let's take the normalized matrix Z^ we calculated in the previous question: $$
\widehat{Z}=\left[\begin{array}{llll}
-1 & 1 & 1 & -1 \
-1 & 1 & 1 & -1 \
-1 & 1 & 1 & -1
\end{array}\right]

We will now apply the formula **$\tilde{z}_i = \gamma \hat{z}_i + \beta$** to each row element-by-element. **Unit 1 (Row 1):** - Original normalized row: $[-1, 1, 1, -1]$ - Multiply by $\gamma_1 = 1$: $[-1, 1, 1, -1] \times 1 = [-1, 1, 1, -1]$ - Add $\beta_1 = 0$: $[-1, 1, 1, -1] + 0 = \mathbf{[-1, 1, 1, -1]}$ **Unit 2 (Row 2):** - Original normalized row: $[-1, 1, 1, -1]$ - Multiply by $\gamma_2 = 1$: $[-1, 1, 1, -1] \times 1 = [-1, 1, 1, -1]$ - Add $\beta_2 = -10$: $[-1, 1, 1, -1] - 10 = \mathbf{[-11, -9, -9, -11]}$ **Unit 3 (Row 3):** - Original normalized row: $[-1, 1, 1, -1]$ - Multiply by $\gamma_3 = 1$: $[-1, 1, 1, -1] \times 1 = [-1, 1, 1, -1]$ - Add $\beta_3 = 10$: $[-1, 1, 1, -1] + 10 = \mathbf{}$ --- ### Final Answer By stacking our newly scaled and shifted rows back together, the final output matrix $\tilde{Z}$ for the Batch Normalization layer is:

\bar{Z}=\left[\begin{array}{cccc}
-1 & 1 & 1 & -1 \
-11 & -9 & -9 & -11 \
9 & 11 & 11 & 9
\end{array}\right]

## Part 3 Describe the differences of computations required for batch normalization during training and testing. --- To solve this, we need to understand a fundamental rule of Machine Learning: **Your model's prediction for a single test image should never change depending on what other images happen to be in the same batch.** Here is the step-by-step breakdown from scratch of exactly how and why the computations change between training and testing. --- ### 1. Computations During Training (The "Mini-Batch" Phase) During the training phase, you are passing data through the network in chunks called **mini-batches** (e.g., 32 images at a time). Here is exactly what the computer calculates for the Batch Normalization layer during training: 1. **Calculate the Batch Mean ($\mu_\mathcal{B}$):** It calculates the average of the data _specifically for the current mini-batch_. 2. **Calculate the Batch Variance ($\sigma_\mathcal{B}^2$):** It calculates the spread of the data _specifically for the current mini-batch_. 3. **Normalize:** It uses that specific mini-batch mean and variance to normalize the data (making it mean 0 and variance 1). 4. **Scale and Shift:** It applies the learnable parameters $\gamma$ and $\beta$ to scale and shift the data. 5. **Update Parameters:** The network uses backpropagation to learn and update the best values for $\gamma$ and $\beta$. **The Secret Extra Step:** While it is doing all of this, the network is also quietly calculating a **"running average"** (a global average) of every single mean and variance it sees across all the different mini-batches. It saves this global average in its memory for later. ### 2. The Problem: Why can't we do this during Testing? Imagine you have finished training your network and you deploy it to a hospital to diagnose X-Rays. A doctor uploads a single X-Ray image. - If you try to run Batch Normalization the same way you did in training, the network will try to calculate the mean and variance of the "batch". - But the batch size is just 1! The math will completely break. Furthermore, even if the doctor uploaded 10 X-Rays at once, you **cannot** use the mean of those 10 images. If you did, the diagnosis for Patient A would physically change depending on whether Patient B's X-Ray was in the same batch. Your professor notes that this is a very common mistake. ### 3. Computations During Testing (The "Fixed" Phase) Because of the problem above, the Batch Normalization layer fundamentally changes how it functions during test time. Here is what it computes during testing: 1. **NO Mean/Variance Calculation:** The network **does not** compute the mean or variance from the incoming test data. 2. **Retrieve the Running Averages:** Instead, it digs into its memory and pulls out the **global running average** for the mean ($\mu$) and variance ($\sigma^2$) that it saved during the training phase. 3. **Normalize:** It normalizes the new test data using those _fixed_ training averages. 4. **Scale and Shift:** It scales and shifts the data using the _fixed_ $\gamma$ and $\beta$ weights that it fully learned during training. --- ### Final Answer Here is a clear, detailed way you can summarize this for your homework: **Differences in Computation for Batch Normalization:** - **During Training:** The layer computes the mean ($\mu_\mathcal{B}$) and variance ($\sigma_\mathcal{B}^2$) directly from the current **mini-batch** of data. It uses these mini-batch statistics to normalize the inputs. During this time, it continuously learns the scaling ($\gamma$) and shifting ($\beta$) parameters via backpropagation. Simultaneously, it maintains and saves a **running average** (global average) of the means and variances seen across all training batches. - **During Testing (Prediction):** The layer **does not** compute the mean and variance from the test data. Doing so would cause a single sample's prediction to improperly depend on other samples in the test batch. Instead, it uses the **fixed running averages** of the mean and variance that were saved during training. It then applies the fixed, fully learned $\gamma$ and $\beta$ parameters to scale and shift the normalized test data. ### Understanding Unit vs. Mini-batch This is a very normal point of confusion! The phrases "from the batch" and "for each unit separately" are not mutually exclusive. In fact, they describe two different dimensions of the exact same matrix. Let's bring back your $3 \times 4$ matrix example to visualize exactly how both concepts happen at the same time.

\left[\begin{array}{cccc}
12 & 14 & 14 & 12 \
0 & 10 & 10 & 0 \
-5 & 5 & 5 & -5
\end{array}\right]

### 1. "For each unit separately" (The Rows) You are completely correct that we calculate the mean and variance for each unit separately. - **Row 1** represents **Unit 1**. - **Row 2** represents **Unit 2**. - **Row 3** represents **Unit 3**. We do _not_ calculate a giant average using all 12 numbers in the matrix. We calculate a separate $\mu$ for Row 1, a separate $\mu$ for Row 2, and a separate $\mu$ for Row 3. ### 2. "From the mini-batch" (The Columns) So, if we are looking at Unit 1 (Row 1), where do the numbers `12, 14, 14, 12` actually come from? - They come from the **4 examples in your mini-batch** (the columns). - To get the mean for Unit 1 ($\mu = 13$), you are adding up the output of Unit 1 _across the 4 examples in the current batch_. This is why we say we are "normalizing based on the batch data." The average of Unit 1 is strictly dependent on which 4 images happen to be in this specific batch. If the network grabs 4 completely different images for the next batch, the numbers in Row 1 will change, and the mean for Unit 1 will change. ### Why this explains the Training vs. Testing difference During **Training**, we have a guaranteed mini-batch size (e.g., $m=4$). This gives us 4 columns, which provides enough numbers to calculate an average and a variance for each row. During **Testing**, imagine a user uploads just **one single image** for the network to evaluate. Your matrix is no longer $3 \times 4$. It is now $3 \times 1$: $$ \left[\begin{array}{c} 10 \ 5 \ 2 \end{array}\right] $$ Now, look at Unit 1 (Row 1). It only has a single number (`10`). You cannot calculate a variance for a single number! The math completely breaks. Furthermore, even if the user uploaded 4 test images at once, you wouldn't want to calculate the mean across them. If you did, the network's prediction for Image 1 would change depending on whether Image 2 was a dog or a cat. **The Solution:** Because the testing batch size could be 1, or because we don't want test images interfering with each other, we completely stop calculating the mean and variance across the columns. Instead, for Unit 1, we just plug in the **global running average** that Unit 1 saw across all the thousands of batches it processed during the training phase. ## Part 4 Describe how the batch size during testing affect testing results. --- This question is actually a direct follow-up to the previous one, and it tests whether you truly understand the "fixed" nature of the testing phase we just talked about. Here is the step-by-step breakdown of how to think about this question and how to formulate your answer. ### 1. The Direct Answer The short and most important answer is: **The batch size during testing has absolutely no mathematical effect on the testing results (predictions) for any individual sample.** Whether you pass $1$ image through the network, or a batch of $100$ images at the exact same time, the prediction for Image #1 will be exactly the same. ### 2. The Reason (Why doesn't it matter?) To understand why, we have to look back at the golden rule of the Testing Phase from your professor's notes: _"At test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used"_. Think about the math formula for Batch Normalization: $$ \hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

Because μ, σ2, γ, and β are all locked, fixed numbers during testing, the math applied to a single input vector zi is completely isolated. The network doesn't look at the rest of the batch to calculate anything, so the size of the batch is mathematically irrelevant to the output!

3. The "What If" Scenario (Why must it be this way?)

To solidify this for your understanding, imagine what would happen if the testing batch size did affect the results.

Let's go back to our hospital X-Ray example:

This violates a fundamental rule of machine learning: a model's prediction for a specific data point must depend only on that data point. By using fixed training statistics, we guarantee that the testing batch size has zero influence on the prediction.


Final Answer

Here is a clear and professional way you can summarize this for your assignment:

"The batch size during testing does not affect the testing results or predictions for any individual sample. During the testing phase, the Batch Normalization layer no longer computes the mean and variance from the incoming batch data. Instead, it uses the fixed running averages of the mean and variance that were pre-computed and saved during the training phase. Because the normalization parameters (μ, σ2, γ, β) are completely fixed during testing, each sample in a test batch is processed completely independently. Therefore, whether a sample is passed through the network in a batch size of 1 or a batch size of 100, the mathematical output for that specific sample will remain exactly the same."

4. LeNet for Image Recognition

In this coding assignment, you will need to complete the implementation of LeNet (LeCun Network) using Pytorch and apply the LeNet to the image recognition task on Cifar-10 (10-classes classification). The access to the Cifar-10 Dataset are here (https://www.cs.toronto.edu/~kriz/cifar.html). In addition, you will need to install the python packages “tqdm” and “pytorch”. The installation guides of PyTorch are in “readme.txt”. Please read carefully and follow the instructions. You are expected to implement your solution based on the given codes. The only file you need to modify is the “solution.py” file. You can test your solution by running the “main.py” file.

Part 1

Download and extract the Cifar10 Dataset from the link above. Put the data folder “cifar-10-batches-py” in the same directory of “code”. Read carefully the instructions and then complete the function load_data().


1. Understanding the CIFAR-10 Data Format

Before we write the code, we need to know how the CIFAR-10 creators saved the dataset. According to the dataset documentation in your sources:

2. The Reshape Math (Why 3072 -> 3 x 32 x 32?)

A standard color image has a Height, a Width, and 3 Color Channels (Red, Green, Blue).

The creators of the dataset "flattened" these 3D images into a single 1D row of 3072 numbers to make them easier to save.

To feed this data into a Convolutional Neural Network (LeNet), we cannot use a flat row. We must restore it to its 3D box shape: [Channels (3), Height (32), Width (32)].


3. Step-by-Step Implementation

Here is exactly how we will build your load_data function:

  1. Unpickling Helper: We will write a tiny function inside load_data to open the pickled files and read the dictionaries.
  2. Load Training Data: We will loop through numbers 1 to 5 to open data_batch_1 through data_batch_5, extracting the pixels and labels from each and appending them to a list.
  3. Combine Arrays: We will use numpy to stitch those 5 separate batches together into one giant array of 50,000 images and 50,000 labels.
  4. Load Testing Data: We will open test_batch to get our 10,000 testing images and labels.
  5. Reshape & Type Cast: We will use the numpy .reshape() command to convert the flat 3072 arrays into the (3, 32, 32) format. We will also force the data types to match your assignment's requirements (uint8 for images, int64 for labels).

The Code

Here is the complete code you can copy and paste into your solution.py file inside the load_data(data_dir): block. Make sure you import os, pickle, and numpy at the very top of your solution.py file!

import os
import pickle
import numpy as np

def load_data(data_dir):
    '''
    To load the Cifar-10 Dataset from files and reshape the
    images arrays from shape  to shape.
    ...
    '''

    ### YOUR CODE HERE

    # 1. Helper function to unpickle the files
    def unpickle(file):
        with open(file, 'rb') as fo:
            # encoding='bytes' ensures Python 3 reads the byte strings properly
            batch_dict = pickle.load(fo, encoding='bytes')
        return batch_dict

    # 2. Initialize empty lists to hold our batches of data
    x_train_list = []
    y_train_list = []

    # 3. Loop through training batches 1 to 5
    for i in range(1, 6):
        # Construct the file path (e.g., '../cifar-10-batches-py/data_batch_1')
        filename = os.path.join(data_dir, f'data_batch_{i}')

        # Unpickle the dictionary
        batch_dict = unpickle(filename)

        # The dictionary keys are stored as byte strings (b'data' and b'labels')
        x_train_list.append(batch_dict[b'data'])
        y_train_list += batch_dict[b'labels']  # use += to combine the lists

    # 4. Combine the 5 lists into single numpy arrays
    # vstack stacks the 5 (10000, 3072) arrays into one (50000, 3072) array
    x_train = np.vstack(x_train_list)
    # Convert the combined labels list into a numpy array of type int64
    y_train = np.array(y_train_list, dtype=np.int64)

    # 5. Load the test batch
    test_filename = os.path.join(data_dir, 'test_batch')
    test_dict = unpickle(test_filename)

    x_test = test_dict[b'data']
    y_test = np.array(test_dict[b'labels'], dtype=np.int64)

    # 6. Reshape the image data!
    # -1 tells numpy to automatically figure out the first dimension (50000 or 10000)
    # 3, 32, 32 separates the flat 3072 row into 3 channels of 32x32 grids.
    # .astype(np.uint8) ensures the pixel values are standard 8-bit unsigned integers.
    x_train = x_train.reshape(-1, 3, 32, 32).astype(np.uint8)
    x_test = x_test.reshape(-1, 3, 32, 32).astype(np.uint8)

    ### END YOUR CODE

    return x_train, y_train, x_test, y_test

A quick note on why b'data' instead of 'data'

Because the CIFAR-10 dataset is a bit older and was originally pickled in Python 2, loading it in modern Python 3 causes the string keys in the dictionary to be read as "byte strings". In Python, you write a byte string by putting a b in front of the quotes. So, we ask the dictionary for b'data' instead of 'data'.

Let me know when you get this successfully running in your main.py and are ready to tackle the LeNet model itself!

Running the code:

> python3 main.py
Loading and preprocessing...
/Users/macc/TAMU/Spring_26/CSCE-421/HW3/code/solution.py:37: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)
  batch_dict = pickle.load(fo, encoding='bytes')

Why is this happening? The message you are seeing is a VisibleDeprecationWarning, which is just a heads-up from the system, not an actual error that stops your code. The CIFAR-10 dataset files were originally created and saved (using Python's cPickle module) many years ago using a much older version of the NumPy library. You are currently using a very modern version of NumPy (version 2.4).

When your modern NumPy reads the old pickled file using pickle.load(fo, encoding='bytes'), it notices that the arrays were saved using an outdated formatting parameter (align=0). It prints this warning to let you know the format is deprecated, but it is still perfectly capable of reading the data. The data arrays loaded into your computer's memory are exactly what they need to be.

You can safely ignore this warning and move right along! Let me know what the next part of your assignment asks you to do to start building the LeNet model.

Part 2

Complete the function preprocessing(), you need to implement two ways of preprocessing the data: (1) rescaling: rescales the image pixels from range 0-255 to range 0-1; (2) normalization: centralize and rescale each image using its mean and variance. There is a parameter normalize in the preprocessing() to control whether to rescale or normalize the images.


Data preprocessing is a crucial step because neural networks learn much faster and more stably when the input numbers are small and on a standardized scale.

1. The Goal of the Function

The docstring tells us we have two arrays: train_images and test_images. Currently, they contain pixel values ranging from 0 to 255, and their data type is uint8 (8-bit integers). The docstring also specifies that our function must return these arrays as float64 (decimals). Therefore, the very first thing we must do is convert the data type of both arrays.

2. Method 1: Rescaling (normalize=False)

If the user passes normalize=False, we just need to squash the 0 to 255 range down to a 0 to 1 range. Because the maximum pixel value is 255, we can achieve this perfectly by simply dividing every single number in the arrays by 255.0.

3. Method 2: Normalization (normalize=True)

If the user passes normalize=True, we have to do something a bit more mathematically complex. We need to centralize the data (give it a mean of 0) and rescale it using variance (give it a spread of 1).

To do this, we must remember a critical rule from your notes:

The Python Implementation

Here is the code you can copy and paste into your solution.py file to complete the preprocess function:

import numpy as np

def preprocess(train_images, test_images, normalize=False):
    '''
    To preprocess the data by
        (1).Rescaling the pixels from integers in [0,255) to
            floats in [0,1), or
        (2).Normalizing each image using its mean and variance.
    ...
    '''
    ### YOUR CODE HERE

    # 1. Cast the arrays to float64 as required by the docstring's Returns section
    train_images = train_images.astype(np.float64)
    test_images = test_images.astype(np.float64)

    if not normalize:
        # (1) Rescaling: Squeeze the range down to
        train_images = train_images / 255.0
        test_images = test_images / 255.0

    else:
        # (2) Normalization: Centralize (subtract mean) and rescale (divide by std)

        # Calculate the mean and standard deviation using ONLY the training data.
        # axis=(0, 2, 3) calculates these statistics per color channel (R, G, B)
        # across all images (axis 0), heights (axis 2), and widths (axis 3).
        # keepdims=True ensures the shape aligns perfectly for subtraction/division later.
        train_mean = np.mean(train_images, axis=(0, 2, 3), keepdims=True)
        train_std = np.std(train_images, axis=(0, 2, 3), keepdims=True)

        # Prevent division by zero just in case a channel is completely blank
        train_std = np.maximum(train_std, 1e-7)

        # Apply the EXACT SAME training statistics to normalize BOTH datasets
        train_images = (train_images - train_mean) / train_std
        test_images = (test_images - train_mean) / train_std

    ### END YOUR CODE

    return train_images, test_images

A Quick Note on the Math

You might notice the prompt asks to rescale using "variance", but the code uses np.std (Standard Deviation). In statistics and deep learning, the standard formula to normalize data to a unit variance is to divide by the standard deviation (which is simply the square root of variance). Dividing directly by the raw variance is mathematically incorrect for this goal, so np.std is the proper NumPy tool to use here!

Let me know if the axis=(0, 2, 3) part makes sense or if you are ready to move on to building the actual LeNet model!

Part 3

Complete the class LeNet().
In particular, you need to complete functions __init__() and forward() in the class. The paper for LeNet can be found here (https://ieeexplore.ieee.org/document/726791) The network architecture is shown in the figure below (LeNet.png).

The subsampleing is implemented by using the max pooling. And the kernel size for all the convolutional layers are 5 × 5. The sequential layers are:

The subsampleing is implemented by using the max pooling. And the kernel size for all the convolutional layers are 5 × 5. The sequential layers are:

Inputs → Convolution (6 out channels) → BN  → ReLU  → Max Pooling  → Convolution (16 out channels) → BN → ReLU → Max Pooling →Reshape to vector → Fully-connected (120 out units) → BN → ReLU → Fully-connected (84 out units) → BN → ReLU → Dropout → Outputs(n_classes out units)

For this part, you are only allowed to use the APIs in torch.nn. Please refer to the PyTorch API documents below for the usage of those APIs before you use them: https://pytorch.org/docs/stable/nn.html


Building your first Neural Network from scratch is a major milestone! To implement the LeNet class, we need to map your professor's exact layer sequence to PyTorch functions inside the __init__ method, and then pass the data through them in the forward method.

Before we write the code, we must solve one critical mathematical puzzle: What is the size of the vector when we "Reshape to vector"?

1. The Math: Tracking the Tensor Shape

Just like we did in Question 1 of your homework, we need to track the size of our images as they pass through the convolutional layers using the formula: Output=NFStride+1.

The Reshape Step: At this point, we have 16 feature maps, each 5×5 in size. To pass this into a Fully-Connected (Linear) layer, we must flatten it into a single 1D vector.


2. Implementing __init__() (Defining the Layers)

In the __init__ function, we just initialize the tools (layers) we are going to use.

3. Implementing forward(x) (Passing the Data)

In the forward function, we take the input image x and pass it sequentially through the tools we defined in __init__.

To do the "Reshape to vector" step, we will use the PyTorch command x = x.view(x.size(0), -1).


The Final Code

Here is the fully completed solution.py class implementation. You can replace your LeNet skeleton with this:

import torch
import torch.nn as nn

class LeNet(nn.Module):
    '''
    Build the LeCun network according to the architecture in the homework part 4(c)
    '''
    def __init__(self, n_classes=None):
        super(LeNet, self).__init__()

        ### YOUR CODE HERE

        # --- BLOCK 1: Convolutional Layers ---
        # 1. Convolution (3 in channels for RGB -> 6 out channels, 5x5 kernel)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5)
        self.bn1 = nn.BatchNorm2d(6)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

        # 2. Convolution (6 in channels -> 16 out channels, 5x5 kernel)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
        self.bn2 = nn.BatchNorm2d(16)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

        # --- BLOCK 2: Fully-Connected Layers ---
        # 3. Fully-connected (16*5*5 = 400 inputs -> 120 out units)
        self.fc1 = nn.Linear(in_features=16*5*5, out_features=120)
        self.bn3 = nn.BatchNorm1d(120)
        self.relu3 = nn.ReLU()

        # 4. Fully-connected (120 inputs -> 84 out units)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.bn4 = nn.BatchNorm1d(84)
        self.relu4 = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5) # 0.5 is the standard dropout probability

        # 5. Outputs (84 inputs -> n_classes out units)
        self.fc3 = nn.Linear(in_features=84, out_features=n_classes)

        ### END CODE HERE

    def forward(self, x):
        '''
        Run forward pass of the model defined in the above __init__() function
        '''
        ### YOUR CODE HERE

        # Pass through Block 1
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.pool1(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.pool2(x)

        # Reshape to vector (Flatten the 16x5x5 feature maps into 400-length vectors)
        # x.size(0) keeps the batch dimension separate.
        x = x.view(x.size(0), -1)

        # Pass through Block 2
        x = self.fc1(x)
        x = self.bn3(x)
        x = self.relu3(x)

        x = self.fc2(x)
        x = self.bn4(x)
        x = self.relu4(x)
        x = self.dropout(x)

        # Final Output (Logits)
        logits = self.fc3(x)

        ### END CODE HERE

        return logits

One last detail: Notice that the very last layer fc3 just returns the raw, un-normalized values (called "logits"). It does not apply the Softmax activation function. Why? Because in PyTorch, the standard loss function for multi-class classification (CrossEntropyLoss) actually calculates the Softmax math internally for better numerical stability! Therefore, we just return the raw logits.

Part 4

Try to read and understand the class LeNet_Cifar10(). Run the main.py to train and test the model. You need to train two models: One using normalization preprocessing and the default LeNet architecture as defined in part (c), and another one using rescaling preprocessing and LeNet without Dropout and Batch Normalization. Compare the results of the two models and analyze why there is an improvement on (or why it harms) the performance. Include your training log and test accuracy, as well as a short analysis of the results in your report. The training and testing may take 10-30 minutes to finish on a CPU ( 20 epochs).


Congratulations on making it to the final task! You have built the components, and now it is time to put them all together, run the training loop, and see the deep learning magic happen.

I will first explain what the LeNet_Cifar10 class is doing so you understand the code, then guide you on how to run both models, and finally provide the detailed theoretical analysis you need for your report based on your professor's notes.

1. Understanding the LeNet_Cifar10 Class

This class is a "wrapper" that manages the actual training process. Here is what it does step-by-step:


2. How to Run Model 1 (The Default Model)

This is your "good" model with all the advanced deep learning tricks.

  1. Open main.py and make sure the preprocess line is set to use Normalization: x_train, x_test = preprocess(x_train, x_test, normalize=True)
  2. Make sure your LeNet class in solution.py is exactly as we wrote it earlier (with all the BatchNorm and Dropout layers).
  3. Run python3 main.py.
  4. Action Required: Copy the output printed in your terminal (the Epoch losses and Validation Accuracies) and the final Test Accuracy. Save this in your report!

3. How to Run Model 2 (The Stripped-Down Model)

This is the "barebones" model that relies purely on convolutions and linear layers without the modern tricks.

  1. Open main.py and change the preprocess line to use Rescaling instead: x_train, x_test = preprocess(x_train, x_test, normalize=False)

  2. Open solution.py and comment out the Batch Normalization and Dropout layers in your LeNet class. Your modified forward function should look like this:

    def forward(self, x):
        # Block 1
        x = self.conv1(x)
        # x = self.bn1(x)  <-- COMMENTED OUT
        x = self.relu1(x)
        x = self.pool1(x)
    
        x = self.conv2(x)
        # x = self.bn2(x)  <-- COMMENTED OUT
        x = self.relu2(x)
        x = self.pool2(x)
    
        x = x.view(x.size(0), -1)
    
        # Block 2
        x = self.fc1(x)
        # x = self.bn3(x)  <-- COMMENTED OUT
        x = self.relu3(x)
    
        x = self.fc2(x)
        # x = self.bn4(x)  <-- COMMENTED OUT
        x = self.relu4(x)
        # x = self.dropout(x) <-- COMMENTED OUT
    
        logits = self.fc3(x)
        return logits
    
  3. Run python3 main.py again.

  4. Action Required: Save the new training log and final test accuracy for your report.


4. Analysis for your Report (Why Model 1 performs better)

Once your code finishes running, you will notice that Model 1 converges faster, achieves higher validation accuracy, and has a much better test accuracy than Model 2.

Here is the detailed analysis you can adapt for your report to explain exactly why this happens, drawing directly from your course concepts:

1. The Impact of Data Normalization vs. Rescaling In Model 2, the data was only rescaled to a range of 0 to 1. In Model 1, the data was fully normalized using the training set's mean and variance. As noted in the course, un-normalized inputs cause large features to dominate the weight updates, while small features get drowned out. By forcing the input data to have a zero-mean and unit-variance, Model 1 ensures that all weights update at a "similar pace", preventing the gradient descent from oscillating wildly and allowing the network to learn stable patterns right from the first layer.

2. The Impact of Batch Normalization Model 2 lacks Batch Normalization. As data passes through multiple layers of convolutions and ReLUs, the scale of the activations can shift dramatically, making it very hard for the optimizer to find a good path down the error surface. By adding Batch Normalization in Model 1, we force the activations at every layer to maintain a zero-mean and unit-variance. This "improves gradient flow through the network," reduces the model's dependence on lucky weight initialization, and allows the Adam optimizer to take confident, stable steps. Without it, Model 2 likely struggled to optimize its weights efficiently.

3. The Impact of Dropout (Combating Overfitting) Model 2 has no regularization, meaning it is highly susceptible to overfitting. Because the network has thousands of parameters, it likely started to memorize the training data (fitting the "stochastic and deterministic noise") rather than learning generalizable features. Model 1 utilizes Dropout (with a probability of 0.5) on its fully connected layer. By randomly turning off 50% of the neurons during every training pass, Dropout prevents neurons from "co-adapting" and forces the network to learn robust, redundant representations of the images. This acts as a massive ensemble of different networks, heavily regularizing the model and ensuring that the high training accuracy actually translates into high test accuracy.

Conclusion Summary for Report: Model 2 suffers from slower/unstable optimization (due to lack of input and batch normalization) and poor generalization (due to lack of Dropout regularization). Model 1 utilizes normalization to create a smooth, efficient gradient descent path, and relies on Dropout to combat overfitting, ultimately yielding a vastly superior out-of-sample (Test) accuracy.

# convolutional layers
x = self.conv1(x)
x = self.bn1(x)
x = self.relu1(x)
x = self.pool1(x)

x = self.conv2(x)
x = self.bn2(x)
x = self.relu2(x)
x = self.pool2(x)

# Reshape to vector
x = x.view(x.size(0), -1)

# fully-connected layers
x = self.fc1(x)
x = self.bn3(x)
x = self.relu3(x)

x = self.fc2(x)
x = self.bn4(x)
x = self.relu4(x)
x = self.dropout(x)

# Final output 
x = self.fc3(x)
# convolutional layers
x = self.conv1(x)
# (removed batch normalization)
x = self.relu1(x)
x = self.pool1(x)

x = self.conv2(x)
# (removed batch normalization)
x = self.relu2(x)
x = self.pool2(x)

# Reshape to vector
x = x.view(x.size(0), -1)   

# fully-connected layers
x = self.fc1(x)
# (removed batch normalization)
x = self.relu3(x)

x = self.fc2(x)
# (removed batch normalization)
x = self.relu4(x)
# (removed dropout)

# Final output 
x = self.fc3(x)

Looking at the results, Model 1 definitely outperformed the stripped-down Model 2, scoring a test accuracy of about 65.9% compared to 61.8%, while also hitting a much lower training loss. This makes a lot of sense when you think about the deep learning tricks we added. First off, using full data normalization gives the inputs a zero-mean and unit-variance, which helps all the weights update at a similar pace instead of having small weights oscillate while large weights dominate the updates. Then we have Batch Normalization, which is a game-changer because it improves the gradient flow through the network and makes the whole optimization process much more stable. That perfectly explains why our training loss dropped so much lower. On top of that, Model 1 used Dropout to randomly shut off neurons during training, preventing them from co-adapting. This basically acts like training a huge ensemble of models, forcing the network to learn redundant, robust features instead of just memorizing the training data. So, between the smoother learning path from the normalization steps and the heavy regularization from Dropout, it's no wonder Model 1 generalized so much better to the unseen test images!

Review (Answers)

Question 1

Output size = ((input size + 2 * padding - kernel size) / stride) + 1
Output size = 3 * 3 * 3 * 28 + 28

Question 2

Total trainable parameters = 294,912 + 589,824 + 32,768 = 917,504

Question 3

μ=14(z1+z2+z3+z4)=14[12+14+14+120+10+10+05+5+55]=[1350]σ2=14([155]2+[155]2+[155]2+[155]2)=14([12525]+[12525]+[12525]+[12525])[z^1,z^2,z^3,z^4]=[111111111111]
[1111119911911119]

Question 4

You should be able to get a test accuracy of ~65% on Cifar-10