HW3 - Convolutional Networks

#MachineLearning #Convolution #NeuralNetworks

Notes:

Question 1

A single (15 × 15 × 3) image is passed through a convolutional layer with 28 filters, each of size (3 × 3 × 3). The padding size is 1 (1 unit at top, bottom, left, and right) and the stride size is also 1. What is the size of the output feature map volume? What is the number of parameters in this layer (including bias)? Note that, for simplicity, we consider filters of 3 × 3 × 3 as one filter, instead of three filters of size 3 × 3.

What is a Convolution?

To understand how Convolutional Neural Networks (CNNs) work from scratch, we first need to understand how a computer sees an image and how we can extract patterns from it.

To a computer, an image is simply a two-dimensional array of numbers. In a CNN, we use something called a filter (sometimes called a kernel), which is a much smaller box of numbers. The core idea of a convolution is to take this small filter and slide it across the input image. At every step, the network performs an element-wise multiplication between the numbers in the filter and the numbers in that specific patch of the image, and then sums them all up to produce a single number.

These filters are designed to detect important visual features, like vertical or horizontal edges, which combine to form shapes and objects. The numbers inside these filters are the parameters that the neural network actually learns from the data during training.

Part 1: What is the size of the output feature map volume?

To find the volume of the output, we need to calculate its spatial dimensions (Height and Width) and its depth (Number of output channels/slices).

You are given a single image of size:

15 \times 15 \times 15

This means:

height = 15
width = 15
depth = 3

The layer has 28 filters, each of size:

3 \times 3 \times 3

That means:

height = 3
width = 3
depth = 3

This is important:

A filter must match the full input depth (Filters always extend to the full depth of the input volume), so since the input depth is 3, each filter must also have depth 3.
By having the filter extend through the full depth, the network can learn correlations between channels. A filter can decide, for example, that a feature is only present if there is a high value in the Red channel but a low value in the Blue channel.
When the filter sits on a spot in the input, it performs an element-wise multiplication across every single cell in its 3D volume.
- If your input is $32 \times 32 \times 3$ (RGB) and your filter is $5 \times 5$ , the filter must be $5 \times 5 \times 3$ .
- All $75$ multiplications ( $5 \times 5 \times 3$ ) are performed, and then all of them are summed together into a single number (plus a bias).
Because these values are summed into one scalar for that specific location, the "depth" is collapsed. This is why a single filter always produces a 2D output map, regardless of how deep the input was.

Spatial vs. Depth Dimensions

Width and Height (Spatial): We slide the filter across these because we want to find the same feature (like an eye or a bolt) no matter where it appears in the image (Translation Invariance).
Depth (Channel): We do not slide the filter through the depth because the channels represent different types of information about the same pixels. We want to process all that information simultaneously to define a new feature.

Dimension	Filter Behavior	Purpose
Width ( $W$ )	Slides (Stride)	Locates features horizontally.
Height ( $H$ )	Slides (Stride)	Locates features vertically.
Depth ( $D$ )	Fixed (Full Depth)	Combines channel data into a new feature.

1. Spatial Dimensions (Height and Width)

When you slide a filter over an image, the output size naturally shrinks, and the pixels on the extreme borders are not treated as fairly as the pixels in the center. To prevent the image from shrinking too quickly, we use padding, which means we artificially add a border of zero-value pixels around the outside of the input image.

Your original image is $15 \times 15$ . Because the problem states there is a padding of 1, you are adding 1 pixel to the top, 1 to the bottom, 1 to the left, and 1 to the right. This makes your "effective" input size $15 + 1 + 1 = 17$ for both height and width.

Padding adds a border around the image so that the 3×3 filter can still fit at the edges without shrinking the output.

The problem also mentions a stride of 1. Stride simply dictates how many steps the filter takes when it slides across the image.

To calculate the exact size of the output, your professor provided this formula:

Output Size = \frac{Input Size - Filter Size}{Stride} + 1

Let's plug your numbers into this formula for both height and width:

Output Dimension = \frac{17 - 3}{1} + 1 = 15

Notice how using a padding of 1 with a $3 \times 3$ filter perfectly preserved your original $15 \times 15$ image size.

2. Depth (Number of Channels)

Every time you slide a single filter across the entire image, it generates one completely independent 2D output slice. Because your layer uses 28 filters, the network will generate 28 completely independent output slices and stack them together.

In other words: Each filter produces one feature map. Since there are 28 filters, the output depth will be: 28

Final Answer for Part 1: Combining the height, width, and depth, the final output feature map volume is
$15 \times 15 \times 28$

Part 2: What is the number of parameters in this layer (including bias)?

A "parameter" is a specific weight or number that the network has to learn. To find the total number of parameters in this layer, we need to calculate the parameters for just one filter, and then multiply that by the total number of filters.

1. Parameters in a single filter Color images are not just flat grids; they have depth because they are made of 3 color channels (Red, Green, Blue). Therefore, a filter must also have depth so it can connect to every input channel at the same time. This is why your filter size is given as $3 \times 3 \times 3$ (Height $\times$ Width $\times$ Input Channels).

To find the number of weights in one filter, you multiply those dimensions:

Weights per filter = $3 \times 3 \times 3 = 27$ weights.

Additionally, every single filter in a neural network gets exactly 1 bias parameter added to it, which acts as a threshold.

Total parameters per filter = $27 weights + 1 bias = 28$ parameters.

2. Total parameters in the layer You have 28 of these filters in total. Because each output slice is generated completely independently, every single filter has its own unique set of parameters. For this convolutional layer we have:

Total parameters = $28 filters \times 28 parameters per filter$ .
Total parameters = $784$ .

Final Answer for Part 2: There are $784$ parameters in this convolutional layer.

Question 2

In this question, you can (a) assume padding of appropriate size is used in convolutional layers, and (b) ignore batch normalization. Given the residual block as below:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/HWs/Visual Aids/image.png320

1. Skip connection

What projection shortcut operations are required on the skip connection?

This question introduces one of the most famous breakthroughs in modern deep learning: the Residual Network (ResNet).

The Basics: Feature Maps and Stride

Feature Maps: When you pass an image through a convolutional filter, the output is called a feature map. If a layer has 256 filters, it will output 256 independent feature maps stacked together (this is the "depth" of the output volume).
Stride: Stride dictates how many pixels the filter jumps when it slides across the image. A standard stride is 1. If you use a stride of 2, the filter jumps by 2 pixels at a time. This effectively skips every other pixel, which cuts the spatial dimensions (height and width) of the feature map exactly in half

What is a Residual Block and a Skip Connection?

In traditional neural networks, data flows straight down a single path, layer by layer. However, researchers found that if you make a network too deep (e.g., 56 layers), the performance actually gets worse because it becomes too mathematically difficult to optimize

To fix this, researchers invented the Residual Block. Instead of forcing the network to learn a completely new transformation from scratch at every layer, they added a second path called a skip connection (the long arrow going down the right side of your diagram)

The Main Path (Left): Runs the data through standard convolutional filters to extract new, complex patterns.
The Skip Connection (Right): Takes the original input and "skips" it past the convolutions, copying it directly to the bottom of the block.

At the very bottom of the block, you hit the "add" operation. The network takes the newly transformed data from the left path and adds it element-by-element to the original data from the right path.

The Problem: The "Add" Crash

Here is the golden rule of residual blocks: To add two volumes of data together, they must have the exact same spatial size (height and width) and the exact same number of feature maps (depth).

Let's track the data through your specific diagram to see why this rule creates a massive problem:

The Input: You start with 128 feature maps. Let's pretend their spatial size is 32×32.
The Main Path (Left):
- The first convolution uses 256 filters, changing the depth from 128 to 256 feature maps.
- It also uses a stride of 2, which cuts our 32×32 spatial size in half, shrinking it to 16×16.
- The second convolution has a stride of 1, so it keeps the size at 16×16 and the depth at 256 feature maps.
The Skip Connection (Right):
- This path just copies the original input. So, it brings down data that has 128 feature maps and a spatial size of 32×32.

The Crash: At the "add" node, the network tries to add a 16 × 16 × 256 block (from the left) to a 32 × 32 × 128 block (from the right). Because the dimensions do not match at all, the math crashes.

The Solution: The Projection Shortcut

To fix this crash, we cannot just blindly copy the input down the skip connection. We have to apply a quick operation to the skip connection to force its dimensions to perfectly match the main path. This is what we call a "projection shortcut."

We need to fix two things on the skip connection:

Fixing the Depth: We need to increase the number of feature maps from 128 to 256. To do this without messing up the actual image patterns, we use a 1x1 convolution with 256 filters. A 1x1 convolution is specifically used to change the number of feature maps.
Fixing the Spatial Size: We need to shrink the height and width by a factor of 2. We do this by applying a stride of 2 to that same 1x1 convolution.

By putting a 1x1 convolution with a stride of 2 on the skip connection, the right path will output a 16 × 16 × 256 block. Now, both paths match perfectly, and the "add" operation will succeed!.

Mathematically

In a Residual Network, the standard formula for a block is $y = F (x) + x$ . However, when the dimensions of the main path $F (x)$ and the input $x$ do not match, we must apply a linear projection $W_{s}$ to the shortcut so that $y = F (x) + W_{s} (x)$ .

Here is a mathematical way to write your answer using LaTeX, which incorporates the tensor dimensions and the output size formula from your notes:

Let the input to the residual block be the tensor $x \in R^{H \times W \times 128}$ . The main convolutional path downsamples the spatial dimensions and increases the filters, yielding an output $F (x) \in R^{\frac{H}{2} \times \frac{W}{2} \times 256}$ .

To perform the element-wise residual addition $y = F (x) + W_{s} (x)$ , the projection shortcut $W_{s}$ must transform $x$ to match the exact dimensions of $F (x)$ .

Depth (Channel Expansion): We map the input channels to the target channels by applying a $1 \times 1$ convolution parameterized by weights $W_{s} \in R^{1 \times 1 \times 128 \times 256}$ . This performs a linear projection that increases the feature maps from $128$ to $256$ .
Spatial Size (Downsampling): We apply a stride of $S = 2$ to this $1 \times 1$ convolution. Using the spatial dimension formula $Output Size = ⌊ \frac{N - F}{S} ⌋ + 1$ , substituting a filter size of $F = 1$ and stride $S = 2$ yields an output size of $⌊ \frac{N - 1}{2} ⌋ + 1 = \frac{N}{2}$ . This mathematically halves the height and width to exactly match the $\frac{H}{2} \times \frac{W}{2}$ spatial dimensions of the main path.

2. Number of trainable parameters

What is the total number of trainable parameters in the block (you can ignore bias terms, but need to consider the skip connection)?

The Core Formula for Trainable Parameters

In #Question 1, we learned that the trainable parameters are the actual numbers (weights) inside the filters that the computer has to learn.

First, we need to know the number of weights in one single filter. A filter looks at a 2D patch (Height × Width) and reaches all the way through the depth of the input data (Input Channels). So, weights in one filter = Height × Width × Input Channels.
Second, we multiply that by the total number of filters used in the layer (which is equal to the number of Output Channels/Feature Maps).

Since the problem explicitly says we can ignore the bias terms, our final mathematical formula is simply: Parameters = (Filter Height × Filter Width × Input Feature Maps) × Output Feature Maps

(Note: You will notice that stride is not in this formula. Stride only changes how the filter moves across the image; it does not change the physical size of the filter itself, so it does not affect the number of parameters!)

Now, let's apply this formula to the three distinct convolutional layers in your residual block.

Step 1: The First Convolution on the Main Path

Looking at the left side of your diagram, the data first passes through a $3 \times 3$ convolution to create 256 feature maps.

Filter Height $\times$ Width: $3 \times 3$
Input Feature Maps: The data coming into this block has 128 feature maps.
Output Feature Maps (Number of Filters): The layer creates 256 feature maps.

Let's plug this into our formula:

Weights per filter = $3 \times 3 \times 128 = 1, 152$
Total parameters = $1, 152 weights \times 256 filters =$ $294, 912$ parameters.

Step 2: The Second Convolution on the Main Path

The data continues down the left side into the second $3 \times 3$ convolution.

Filter Height $\times$ Width: $3 \times 3$
Input Feature Maps: Pay close attention here! The input to this specific layer is the output of the previous layer. The previous layer output 256 feature maps.
Output Feature Maps (Number of Filters): This layer also creates 256 feature maps.

Let's plug this into our formula:

Weights per filter = $3 \times 3 \times 256 = 2, 304$
Total parameters = $2, 304 weights \times 256 filters =$ $589, 824$ parameters.

Step 3: The Skip Connection (Projection Shortcut)

Remember from the previous question that the skip connection (the right path) cannot just be an empty wire. Because the dimensions didn't match, we had to add a $1 \times 1$ convolution to it to increase the feature maps from 128 to 256. These $1 \times 1$ filters also contain learnable parameters!

Filter Height $\times$ Width: $1 \times 1$
Input Feature Maps: This path splits off at the very top of the block, where the original input has 128 feature maps.
Output Feature Maps (Number of Filters): We established we need 256 feature maps to match the main path.

Let's plug this into our formula:

Weights per filter = $1 \times 1 \times 128 = 128$
Total parameters = $128 weights \times 256 filters =$ $32, 768$ parameters.

Step 4: The Final Total

To find the total number of trainable parameters in the entire residual block, we simply add the parameters from all three of these convolutions together:

First Main Conv: $294, 912$
Second Main Conv: $589, 824$
Skip Connection Conv: $32, 768$
Total = $294, 912 + 589, 824 + 32, 768 = 917, 504$

Final Answer Summary:

The total number of trainable parameters in this residual block is 917,504. We calculate this by summing the parameters of the three convolutional operations (using the formula Height × Width × Input Depth × Output Filters without bias):

First $3 \times 3$ layer: $(3 \times 3 \times 128) \times 256 = 294, 912$
Second $3 \times 3$ layer: $(3 \times 3 \times 256) \times 256 = 589, 824$
Skip Connection ( $1 \times 1$ layer): $(1 \times 1 \times 128) \times 256 = 32, 768$

Question 3

Using batch normalization in neural networks requires computing the mean and variance of a tensor. Suppose a batch normalization layer takes vectors $z_{1}, z_{2}, \dots, z_{m}$ as input, where $m$ is the mini-batch size. It computes ${\hat{z}}_{1}, {\hat{z}}_{2}, \dots, {\hat{z}}_{m}$ according to

{\hat{z}}_{i} = \frac{z_{i} - μ}{\sqrt{σ^{2} + ϵ}}

where

μ = \frac{1}{m} \sum_{i = 1}^{m} z_{i}, σ^{2} = \frac{1}{m} \sum_{i = 1}^{m} {(z_{i} - μ)}^{2} .

It then applies a second transformation to obtain ${\tilde{z}}_{1}, {\tilde{z}}_{2}, \dots, {\tilde{z}}_{m}$ using learned parameters $γ$ and $β$ as

{\tilde{z}}_{i} = γ {\hat{z}}_{i} + β .

In this question, you can assume that $ϵ = 0$ .

Part 1

(5 points) You forward-propagate a mini-batch of $m = 4$ examples in your network. Suppose you are at a batch normalization layer, where the immediately previous layer is a fully connected layer with 3 units. Therefore, the input to this batch normalization layer can be represented as the below matrix:

[\begin{array}{cccc} 12 & 14 & 14 & 12 \\ 0 & 10 & 10 & 0 \\ - 5 & 5 & 5 & - 5 \end{array}]

What are ${\hat{z}}_{i}$ ? Please express your answer in a $3 \times 4$ matrix.

1. The Concepts: What is Batch Normalization?

Batch Normalization is considered one of the most important breakthroughs in Deep Learning because it allows networks to train much faster and more stably.

The Problem: As data passes through the many layers of a neural network, the scale of the numbers can get messy. One node (unit) might output values in the thousands, while another node outputs decimals. If the numbers are on completely different scales, the network struggles to learn, and the learning process can oscillate or diverge.

The Solution: To fix this, we force the outputs of the layers to be on a standard, predictable scale. Specifically, we want the data coming out of each unit to have a mean (average) of 0 and a variance (spread) of 1. Your professor’s notes describe this perfectly: "you want zero-mean unit-variance activations? just make them so".

How it works (Mini-Batches & Dimensions):

Mini-batch ( $m = 4$ ): Instead of passing the entire dataset through the network at once, we pass it in small chunks called mini-batches. In your homework, your mini-batch has 4 examples (e.g., 4 images).
Units (Rows): The layer right before this Batch Normalization step has 3 units. This means for each of the 4 examples, the network generates 3 numbers.
The Matrix: In your $3 \times 4$ matrix, each column represents one of the 4 examples. Each row represents one of the 3 units in the network.

The Golden Rule of Batch Norm: Batch Normalization is calculated independently for each dimension (unit). This means you do not calculate the average of the whole matrix. You calculate the average and variance for Row 1 independently, then Row 2 independently, and then Row 3 independently.

2. Why Batch Normalization?

Why do we use Mini-Batches instead of the entire dataset?

To understand this, we need to quickly review how a neural network learns. The network looks at the data, makes a prediction, calculates how wrong it is (the error), and then updates its internal weights to be more accurate next time.

If you have a dataset of 1 million images, you have three choices for how to feed that data to the network:

1. Hardware and Memory Limits (The physical reason) The most straightforward reason we cannot pass the entire dataset at once is that computers simply do not have enough memory to hold it. When a neural network processes data, it has to store all the intermediate math calculations for every single image in the computer's graphics card (GPU) memory.

In theory, a larger batch size is better, but you are strictly limited by your GPU memory.
Most GPUs only have around 10 GB to 18 GB of memory. If you try to pass 1 million high-resolution images at the exact same time, the computer will immediately crash. Therefore, we are forced to divide the data into smaller, manageable chunks called mini-batches.

2. Learning Speed (The mathematical reason) If you use Batch Gradient Descent (passing the entire dataset at once), the network will process all 1 million images, calculate the total error, and then take a single update step. This means your network spent a massive amount of computational power just to learn one single thing.

By using Mini-Batch Gradient Descent, we divide the dataset into small chunks (for example, batches of 32 images).
The network looks at the first 32 images and updates its weights. Then it looks at the next 32 images and updates its weights again.
By the time it finishes looking at the entire dataset, it has updated its weights thousands of times instead of just once. This allows the network to learn much faster and achieve a balanced, moderate convergence speed.

Why is Batch Normalization calculated independently for each dimension (unit)?

To understand this, we need to think about what the numbers coming out of those units actually represent.

1. The Danger of Mixed Scales
Imagine a network trying to predict heart attacks. The data passing through the network contains entirely different types of features: one unit might process a person's age (e.g., $62$ ), while another unit processes their annual salary (e.g., $40, 000$ ).

If we just let these raw numbers flow into the network, the math gets severely distorted. Large weights (like the $40, 000$ salary) will completely dominate the network's updates, while small weights (like the $62$ age) will get drowned out and oscillate or fail to learn properly.

2. The Goal: A "Similar Pace" for Everything
To fix this, we want to force every single piece of data to speak the exact same mathematical language. The goal of Batch Normalization is to guarantee that the data coming out of every single unit has a mean (average) of exactly 0 and a variance (spread) of exactly 1. This ensures that all features update at a "similar pace".

3. Why it MUST be calculated independently
If we took the entire layer of units and calculated one giant average for all of them combined, the $40, 000$ salary numbers would drag the average way up. If we then subtracted that giant average from the age unit, a $62$ -year-old would suddenly be represented by a massive negative number! The scales would still be completely ruined.

Therefore, the only way to ensure every feature is on a level playing field is to compute the empirical mean and variance independently for each dimension (unit).

By calculating the math independently for the "salary" unit, all the salaries are neatly squashed to an average of $0$ .
By calculating the math independently for the "age" unit, all the ages are neatly squashed to an average of $0$ .
Now, when the data moves to the next layer of the neural network, every single unit is outputting numbers on the exact same $0$ -centered scale, allowing the network to train smoothly and efficiently without any single feature dominating the others.

3. Solving the Math Step-by-Step

Let's apply the formulas provided in your question to each row individually.

Unit 1 (Row 1)

The raw signals for the first unit across the 4 examples are: $z^{(1)} =$

[\begin{array}{cccc} 12 & 14 & 14 & 12 \end{array}]

Step A: Calculate the Mean ( $μ$ ) Add them up and divide by $m = 4$ .

$μ_{1} = \frac{12 + 14 + 14 + 12}{4} = \frac{52}{4} = 13$

Step B: Calculate the Variance ( $σ^{2}$ ) Subtract the mean from each number, square the result, and average them.

$σ_{1}^{2} = \frac{(12 - 13)^{2} + (14 - 13)^{2} + (14 - 13)^{2} + (12 - 13)^{2}}{4}$
$σ_{1}^{2} = \frac{(- 1)^{2} + (1)^{2} + (1)^{2} + (- 1)^{2}}{4} = \frac{1 + 1 + 1 + 1}{4} = \frac{4}{4} = 1$

Step C: Normalize ( $\hat{z}$ ) Subtract the mean and divide by the square root of the variance (which is the standard deviation). Note: The problem says to assume $ϵ = 0$ .

Standard deviation = $\sqrt{1} = 1$
${\hat{z}}^{(1)} = \frac{- 13}{1} = [- 1, 1, 1, - 1]$

Unit 2 (Row 2)

The raw signals for the second unit are: $z^{(2)} =$ .

[\begin{array}{cccc} 0 & 10 & 10 & 0 \end{array}]

Step A: Calculate the Mean ( $μ$ )

$μ_{2} = \frac{0 + 10 + 10 + 0}{4} = \frac{20}{4} = 5$

Step B: Calculate the Variance ( $σ^{2}$ )

$σ_{2}^{2} = \frac{(0 - 5)^{2} + (10 - 5)^{2} + (10 - 5)^{2} + (0 - 5)^{2}}{4}$
$σ_{2}^{2} = \frac{(- 5)^{2} + (5)^{2} + (5)^{2} + (- 5)^{2}}{4} = \frac{25 + 25 + 25 + 25}{4} = \frac{100}{4} = 25$

Step C: Normalize ( $\hat{z}$ )

Standard deviation = $\sqrt{25} = 5$
${\hat{z}}^{(2)} = \frac{- 5}{5} = [\frac{- 5}{5}, \frac{5}{5}, \frac{5}{5}, \frac{- 5}{5}] = [- 1, 1, 1, - 1]$

Unit 3 (Row 3)

The raw signals for the third unit are: $z^{(3)} = [- 5, 5, 5, - 5]$ .

Step A: Calculate the Mean ( $μ$ )

$μ_{3} = \frac{- 5 + 5 + 5 - 5}{4} = \frac{0}{4} = 0$

Step B: Calculate the Variance ( $σ^{2}$ )

$σ_{3}^{2} = \frac{(- 5 - 0)^{2} + (5 - 0)^{2} + (5 - 0)^{2} + (- 5 - 0)^{2}}{4}$
$σ_{3}^{2} = \frac{25 + 25 + 25 + 25}{4} = \frac{100}{4} = 25$

Step C: Normalize ( $\hat{z}$ )

Standard deviation = $\sqrt{25} = 5$
${\hat{z}}^{(3)} = \frac{[- 5, 5, 5, - 5] - 0}{5} = [\frac{- 5}{5}, \frac{5}{5}, \frac{5}{5}, \frac{- 5}{5}] = [- 1, 1, 1, - 1]$

Final Answer

By stacking our normalized rows back together, the final normalized $3 \times 4$ matrix $\hat{Z}$ is:

\hat{Z} = [\begin{array}{llll} - 1 & 1 & 1 & - 1 \\ - 1 & 1 & 1 & - 1 \\ - 1 & 1 & 1 & - 1 \end{array}]

(Notice how, despite starting with completely different ranges of numbers in the original matrix, the Batch Normalization successfully squashed every single row into the exact same standardized scale!)

The Vector Approach (The Professor's Formula)

In your professor's formula, the $z_{i}$ terms are column vectors. In your specific problem, they look like this:

$z_{1} = [\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}]$ , $z_{2} = [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}]$ , $z_{3} = [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}]$ , $z_{4} = [\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}]$

Look closely at the professor's formula for the mean: $μ = \frac{1}{m} \sum_{i = 1}^{m} z_{i}$ .
Because you are adding vectors together, you must follow the rules of linear algebra, which dictate that vector addition is performed element-wise (row by row).

Let's plug the vectors into the professor's formula:
$μ = \frac{1}{4} ([\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}] + [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}] + [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}] + [\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}])$

When you add those columns together, you add the top row together, the middle row together, and the bottom row together:
$μ = \frac{1}{4} [\begin{matrix} 12 + 14 + 14 + 12 \\ 0 + 10 + 10 + 0 \\ - 5 + 5 + 5 + - 5 \end{matrix}] = \frac{1}{4} [\begin{matrix} 52 \\ 20 \\ 0 \end{matrix}] = [\begin{matrix} 13 \\ 5 \\ 0 \end{matrix}]$

Notice what just happened! The resulting mean $μ$ is a vector containing exactly the three numbers we found when we calculated it row-by-row.

$13$ is the mean of Unit 1
$5$ is the mean of Unit 2
$0$ is the mean of Unit 3

Why did I explain it row-by-row?

I broke it down row-by-row because your professor's notes explicitly state the golden rule of Batch Normalization: "compute the empirical mean and variance independently for each dimension".

If you look at later slides, the professor actually expands the notation to show this explicitly by using two indices: ${\hat{x}}_{i, j} = \frac{x_{i, j} - μ_{j}}{\sqrt{σ_{j}^{2} + ϵ}}$ . In this expanded notation:

$i$ represents the batch example (the columns).
$j$ represents the dimension/unit (the rows).

Summary

When the formula shows $z_{i}$ , it is taking the entire column (all features of one example).
Because $μ$ and $σ^{2}$ are calculated by adding those columns together, the linear algebra naturally computes the averages straight across the rows.
Therefore, the single vector formula ${\hat{z}}_{i} = \frac{z_{i} - μ}{\sqrt{σ^{2} + ϵ}}$ is just a shorthand way of saying "do this for every unit independently," which is exactly what we did!

The Vectorized Approach: Computation

Instead of calculating row-by-row, we can use the formal vector definitions of Batch Normalization, treating each example in the mini-batch as a full column vector. Operations like addition, subtraction, squaring, and division are performed element-wise.

Step 1: Define the Input Vectors ( $z_{i}$ ) Separate the input matrix into $m = 4$ column vectors, where each vector represents one example in the mini-batch: $$z_1 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}, \quad z_2 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_3 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_4 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}$$

Step 2: Calculate the Mean Vector ( $μ$ ) Add all column vectors together and divide by $m$ : $$\mu = \frac{1}{4} (z_1 + z_2 + z_3 + z_4) = \frac{1}{4} \begin{bmatrix} 12+14+14+12 \ 0+10+10+0 \ -5+5+5-5 \end{bmatrix} = \begin{bmatrix} 13 \ 5 \ 0 \end{bmatrix}$$

Step 3: Calculate the Variance Vector ( $σ^{2}$ ) Subtract the mean vector $μ$ from each input vector, square the resulting elements, and average them: $$\sigma^2 = \frac{1}{4} \left( (z_1 - \mu)^2 + (z_2 - \mu)^2 + (z_3 - \mu)^2 + (z_4 - \mu)^2 \right)$$ $$\sigma^2 = \frac{1}{4} \left( \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 \right) = \frac{1}{4} \left( \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} \right) = \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix}$$

Step 4: Normalize Each Vector ( ${\hat{z}}_{i}$ ) Apply the normalization formula ${\hat{z}}_{i} = \frac{z_{i} - μ}{\sqrt{σ^{2}}}$ (assuming $ϵ = 0$ ) using element-wise division. The standard deviation vector is $\sqrt{σ^{2}} = [\begin{matrix} 1 5 5 \end{matrix}]$ .

${\hat{z}}_{1} = [\begin{matrix} (12 - 13) / 1 (0 - 5) / 5 (- 5 - 0) / 5 \end{matrix}] = [\begin{matrix} - 1 - 1 - 1 \end{matrix}]$
${\hat{z}}_{2} = [\begin{matrix} (14 - 13) / 1 (10 - 5) / 5 (5 - 0) / 5 \end{matrix}] = [\begin{matrix} 1 1 1 \end{matrix}]$
${\hat{z}}_{3} = [\begin{matrix} (14 - 13) / 1 (10 - 5) / 5 (5 - 0) / 5 \end{matrix}] = [\begin{matrix} 1 1 1 \end{matrix}]$
${\hat{z}}_{4} = [\begin{matrix} (12 - 13) / 1 (0 - 5) / 5 (- 5 - 0) / 5 \end{matrix}] = [\begin{matrix} - 1 - 1 - 1 \end{matrix}]$

Step 5: Reconstruct the Final Matrix ( $\hat{Z}$ ) Stack the resulting column vectors back together to form the final $3 \times 4$ normalized matrix: $$\hat{Z} = [\hat{z}_1, \hat{z}_2, \hat{z}_3, \hat{z}_4] = \left[\begin{array}{cccc} -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \end{array}\right]$$

Part 2

Continue with the above setting. Suppose $γ = (1, 1, 1)$ , and $β = (0, - 10, 10)$ . What are ${\tilde{z}}_{i}$ ? Please express your answer in a $3 \times 4$ matrix.

Now that you have successfully normalized the data (giving it a mean of $0$ and a variance of $1$ ), you are ready for the second half of the Batch Normalization process: Scaling and Shifting.

Here is the step-by-step explanation of why we do this and how to solve the math.

1. The Concept: Why Scale and Shift?

In the previous question, we squashed the output of every single unit so that it perfectly centered around $0$ with a spread of $1$ . However, forcing every single layer to have the exact same rigid scale can sometimes be too restrictive and actually hurt the neural network's ability to learn complex patterns.

To fix this, the creators of Batch Normalization added a clever trick: after we normalize the data, we give the network the power to scale and shift the data into whatever range it actually needs.

$γ$ (Gamma): The learned scaling factor (it stretches or squashes the numbers).
$β$ (Beta): The learned shifting factor (it moves the numbers up or down).

If the network decides that the strict $0$ -mean and $1$ -variance was a bad idea, it can use $γ$ and $β$ to completely reverse the normalization and recover the original raw data.

2. Breaking Down the Parameters

Just like the mean and variance, the scaling ( $γ$ ) and shifting ( $β$ ) are applied independently to each dimension (unit/row).

The problem states that $γ = (1, 1, 1)$ and $β = (0, - 10, 10)$ . Because there are 3 units (rows) in our network, these vectors contain 3 numbers. Here is how they match up to our rows:

Unit 1 (Row 1): Scale by $γ_{1} = 1$ , Shift by $β_{1} = 0$
Unit 2 (Row 2): Scale by $γ_{2} = 1$ , Shift by $β_{2} = - 10$
Unit 3 (Row 3): Scale by $γ_{3} = 1$ , Shift by $β_{3} = 10$

3. Solving the Math Step-by-Step

Let's take the normalized matrix $\hat{Z}$ we calculated in the previous question: $$
\widehat{Z}=\left[\begin{array}{llll}
-1 & 1 & 1 & -1 \
-1 & 1 & 1 & -1 \
-1 & 1 & 1 & -1
\end{array}\right]

We will now apply the formula **$\tilde{z}_i = \gamma \hat{z}_i + \beta$** to each row element-by-element. **Unit 1 (Row 1):** - Original normalized row: $[-1, 1, 1, -1]$ - Multiply by $\gamma_1 = 1$: $[-1, 1, 1, -1] \times 1 = [-1, 1, 1, -1]$ - Add $\beta_1 = 0$: $[-1, 1, 1, -1] + 0 = \mathbf{[-1, 1, 1, -1]}$ **Unit 2 (Row 2):** - Original normalized row: $[-1, 1, 1, -1]$ - Multiply by $\gamma_2 = 1$: $[-1, 1, 1, -1] \times 1 = [-1, 1, 1, -1]$ - Add $\beta_2 = -10$: $[-1, 1, 1, -1] - 10 = \mathbf{[-11, -9, -9, -11]}$ **Unit 3 (Row 3):** - Original normalized row: $[-1, 1, 1, -1]$ - Multiply by $\gamma_3 = 1$: $[-1, 1, 1, -1] \times 1 = [-1, 1, 1, -1]$ - Add $\beta_3 = 10$: $[-1, 1, 1, -1] + 10 = \mathbf{}$ --- ### Final Answer By stacking our newly scaled and shifted rows back together, the final output matrix $\tilde{Z}$ for the Batch Normalization layer is:

\bar{Z}=\left[\begin{array}{cccc}
-1 & 1 & 1 & -1 \
-11 & -9 & -9 & -11 \
9 & 11 & 11 & 9
\end{array}\right]

## Part 3 Describe the differences of computations required for batch normalization during training and testing. --- To solve this, we need to understand a fundamental rule of Machine Learning: **Your model's prediction for a single test image should never change depending on what other images happen to be in the same batch.** Here is the step-by-step breakdown from scratch of exactly how and why the computations change between training and testing. --- ### 1. Computations During Training (The "Mini-Batch" Phase) During the training phase, you are passing data through the network in chunks called **mini-batches** (e.g., 32 images at a time). Here is exactly what the computer calculates for the Batch Normalization layer during training: 1. **Calculate the Batch Mean ($\mu_\mathcal{B}$):** It calculates the average of the data _specifically for the current mini-batch_. 2. **Calculate the Batch Variance ($\sigma_\mathcal{B}^2$):** It calculates the spread of the data _specifically for the current mini-batch_. 3. **Normalize:** It uses that specific mini-batch mean and variance to normalize the data (making it mean 0 and variance 1). 4. **Scale and Shift:** It applies the learnable parameters $\gamma$ and $\beta$ to scale and shift the data. 5. **Update Parameters:** The network uses backpropagation to learn and update the best values for $\gamma$ and $\beta$. **The Secret Extra Step:** While it is doing all of this, the network is also quietly calculating a **"running average"** (a global average) of every single mean and variance it sees across all the different mini-batches. It saves this global average in its memory for later. ### 2. The Problem: Why can't we do this during Testing? Imagine you have finished training your network and you deploy it to a hospital to diagnose X-Rays. A doctor uploads a single X-Ray image. - If you try to run Batch Normalization the same way you did in training, the network will try to calculate the mean and variance of the "batch". - But the batch size is just 1! The math will completely break. Furthermore, even if the doctor uploaded 10 X-Rays at once, you **cannot** use the mean of those 10 images. If you did, the diagnosis for Patient A would physically change depending on whether Patient B's X-Ray was in the same batch. Your professor notes that this is a very common mistake. ### 3. Computations During Testing (The "Fixed" Phase) Because of the problem above, the Batch Normalization layer fundamentally changes how it functions during test time. Here is what it computes during testing: 1. **NO Mean/Variance Calculation:** The network **does not** compute the mean or variance from the incoming test data. 2. **Retrieve the Running Averages:** Instead, it digs into its memory and pulls out the **global running average** for the mean ($\mu$) and variance ($\sigma^2$) that it saved during the training phase. 3. **Normalize:** It normalizes the new test data using those _fixed_ training averages. 4. **Scale and Shift:** It scales and shifts the data using the _fixed_ $\gamma$ and $\beta$ weights that it fully learned during training. --- ### Final Answer Here is a clear, detailed way you can summarize this for your homework: **Differences in Computation for Batch Normalization:** - **During Training:** The layer computes the mean ($\mu_\mathcal{B}$) and variance ($\sigma_\mathcal{B}^2$) directly from the current **mini-batch** of data. It uses these mini-batch statistics to normalize the inputs. During this time, it continuously learns the scaling ($\gamma$) and shifting ($\beta$) parameters via backpropagation. Simultaneously, it maintains and saves a **running average** (global average) of the means and variances seen across all training batches. - **During Testing (Prediction):** The layer **does not** compute the mean and variance from the test data. Doing so would cause a single sample's prediction to improperly depend on other samples in the test batch. Instead, it uses the **fixed running averages** of the mean and variance that were saved during training. It then applies the fixed, fully learned $\gamma$ and $\beta$ parameters to scale and shift the normalized test data. ### Understanding Unit vs. Mini-batch This is a very normal point of confusion! The phrases "from the batch" and "for each unit separately" are not mutually exclusive. In fact, they describe two different dimensions of the exact same matrix. Let's bring back your $3 \times 4$ matrix example to visualize exactly how both concepts happen at the same time.

\left[\begin{array}{cccc}
12 & 14 & 14 & 12 \
0 & 10 & 10 & 0 \
-5 & 5 & 5 & -5
\end{array}\right]

### 1. "For each unit separately" (The Rows) You are completely correct that we calculate the mean and variance for each unit separately. - **Row 1** represents **Unit 1**. - **Row 2** represents **Unit 2**. - **Row 3** represents **Unit 3**. We do _not_ calculate a giant average using all 12 numbers in the matrix. We calculate a separate $\mu$ for Row 1, a separate $\mu$ for Row 2, and a separate $\mu$ for Row 3. ### 2. "From the mini-batch" (The Columns) So, if we are looking at Unit 1 (Row 1), where do the numbers `12, 14, 14, 12` actually come from? - They come from the **4 examples in your mini-batch** (the columns). - To get the mean for Unit 1 ($\mu = 13$), you are adding up the output of Unit 1 _across the 4 examples in the current batch_. This is why we say we are "normalizing based on the batch data." The average of Unit 1 is strictly dependent on which 4 images happen to be in this specific batch. If the network grabs 4 completely different images for the next batch, the numbers in Row 1 will change, and the mean for Unit 1 will change. ### Why this explains the Training vs. Testing difference During **Training**, we have a guaranteed mini-batch size (e.g., $m=4$). This gives us 4 columns, which provides enough numbers to calculate an average and a variance for each row. During **Testing**, imagine a user uploads just **one single image** for the network to evaluate. Your matrix is no longer $3 \times 4$. It is now $3 \times 1$: $$ \left[\begin{array}{c} 10 \ 5 \ 2 \end{array}\right] $$ Now, look at Unit 1 (Row 1). It only has a single number (`10`). You cannot calculate a variance for a single number! The math completely breaks. Furthermore, even if the user uploaded 4 test images at once, you wouldn't want to calculate the mean across them. If you did, the network's prediction for Image 1 would change depending on whether Image 2 was a dog or a cat. **The Solution:** Because the testing batch size could be 1, or because we don't want test images interfering with each other, we completely stop calculating the mean and variance across the columns. Instead, for Unit 1, we just plug in the **global running average** that Unit 1 saw across all the thousands of batches it processed during the training phase. ## Part 4 Describe how the batch size during testing affect testing results. --- This question is actually a direct follow-up to the previous one, and it tests whether you truly understand the "fixed" nature of the testing phase we just talked about. Here is the step-by-step breakdown of how to think about this question and how to formulate your answer. ### 1. The Direct Answer The short and most important answer is: **The batch size during testing has absolutely no mathematical effect on the testing results (predictions) for any individual sample.** Whether you pass $1$ image through the network, or a batch of $100$ images at the exact same time, the prediction for Image#1will be exactly the same. ### 2. The Reason (Why doesn't it matter?) To understand why, we have to look back at the golden rule of the Testing Phase from your professor's notes: _"At test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used"_. Think about the math formula for Batch Normalization: $$ \hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

During Training: The $μ$ and $σ^{2}$ are calculated live using the other images in the current batch. Therefore, the batch size and the specific images inside the batch physically alter the math.
During Testing: The $μ$ and $σ^{2}$ are hardcoded constants. The network pulls the global running averages from its memory (saved from the training phase) and plugs them into the equation.

Because $μ$ , $σ^{2}$ , $γ$ , and $β$ are all locked, fixed numbers during testing, the math applied to a single input vector $z_{i}$ is completely isolated. The network doesn't look at the rest of the batch to calculate anything, so the size of the batch is mathematically irrelevant to the output!

3. The "What If" Scenario (Why must it be this way?)

To solidify this for your understanding, imagine what would happen if the testing batch size did affect the results.

Let's go back to our hospital X-Ray example:

Imagine the hospital groups 10 patient X-Rays together to send through the neural network as a single batch.
If the network calculated the mean and variance using those 10 specific patients, the numbers would squish and stretch based on the health of the entire group.
This means Patient A's diagnosis could change from "Healthy" to "Sick" entirely depending on whether Patient B (who happened to be in the same batch) was healthy or sick!

This violates a fundamental rule of machine learning: a model's prediction for a specific data point must depend only on that data point. By using fixed training statistics, we guarantee that the testing batch size has zero influence on the prediction.

Final Answer

Here is a clear and professional way you can summarize this for your assignment:

"The batch size during testing does not affect the testing results or predictions for any individual sample. During the testing phase, the Batch Normalization layer no longer computes the mean and variance from the incoming batch data. Instead, it uses the fixed running averages of the mean and variance that were pre-computed and saved during the training phase. Because the normalization parameters ( $μ$ , $σ^{2}$ , $γ$ , $β$ ) are completely fixed during testing, each sample in a test batch is processed completely independently. Therefore, whether a sample is passed through the network in a batch size of 1 or a batch size of 100, the mathematical output for that specific sample will remain exactly the same."

4. LeNet for Image Recognition

In this coding assignment, you will need to complete the implementation of LeNet (LeCun Network) using Pytorch and apply the LeNet to the image recognition task on Cifar-10 (10-classes classification). The access to the Cifar-10 Dataset are here (https://www.cs.toronto.edu/~kriz/cifar.html). In addition, you will need to install the python packages “tqdm” and “pytorch”. The installation guides of PyTorch are in “readme.txt”. Please read carefully and follow the instructions. You are expected to implement your solution based on the given codes. The only file you need to modify is the “solution.py” file. You can test your solution by running the “main.py” file.

Part 1

Download and extract the Cifar10 Dataset from the link above. Put the data folder “cifar-10-batches-py” in the same directory of “code”. Read carefully the instructions and then complete the function $load_data()$ .

1. Understanding the CIFAR-10 Data Format

Before we write the code, we need to know how the CIFAR-10 creators saved the dataset. According to the dataset documentation in your sources:

The Files: The training data is split into 5 files named data_batch_1 through data_batch_5. The testing data is in a single file called test_batch.
The Format: These files are not standard image files (like .png or .jpg). They are Python "pickled" objects, which is a way to save Python dictionaries directly to a file.
The Dictionary Contents: When you "unpickle" (open) one of these files, you get a dictionary containing:
- data: A numpy array of size 10000 x 3072. These are the actual pixels.
- labels: A list of 10,000 numbers (0-9) that represent the true class of each image.

2. The Reshape Math (Why 3072 -> 3 x 32 x 32?)

A standard color image has a Height, a Width, and 3 Color Channels (Red, Green, Blue).

CIFAR-10 images are 32x32 pixels.
Because they are color, they have 3 channels.
$32 (Height) \times 32 (Width) \times 3 (Channels) = 3072 pixels total$ .

The creators of the dataset "flattened" these 3D images into a single 1D row of 3072 numbers to make them easier to save.

The first 1024 numbers are the Red channel.
The next 1024 numbers are the Green channel.
The final 1024 numbers are the Blue channel.

To feed this data into a Convolutional Neural Network (LeNet), we cannot use a flat row. We must restore it to its 3D box shape: [Channels (3), Height (32), Width (32)].

3. Step-by-Step Implementation

Here is exactly how we will build your load_data function:

Unpickling Helper: We will write a tiny function inside load_data to open the pickled files and read the dictionaries.
Load Training Data: We will loop through numbers 1 to 5 to open data_batch_1 through data_batch_5, extracting the pixels and labels from each and appending them to a list.
Combine Arrays: We will use numpy to stitch those 5 separate batches together into one giant array of 50,000 images and 50,000 labels.
Load Testing Data: We will open test_batch to get our 10,000 testing images and labels.
Reshape & Type Cast: We will use the numpy .reshape() command to convert the flat 3072 arrays into the (3, 32, 32) format. We will also force the data types to match your assignment's requirements (uint8 for images, int64 for labels).

The Code

Here is the complete code you can copy and paste into your solution.py file inside the load_data(data_dir): block. Make sure you import os, pickle, and numpy at the very top of your solution.py file!

import os
import pickle
import numpy as np

def load_data(data_dir):
    '''
    To load the Cifar-10 Dataset from files and reshape the
    images arrays from shape  to shape.
    ...
    '''

    ### YOUR CODE HERE

    # 1. Helper function to unpickle the files
    def unpickle(file):
        with open(file, 'rb') as fo:
            # encoding='bytes' ensures Python 3 reads the byte strings properly
            batch_dict = pickle.load(fo, encoding='bytes')
        return batch_dict

    # 2. Initialize empty lists to hold our batches of data
    x_train_list = []
    y_train_list = []

    # 3. Loop through training batches 1 to 5
    for i in range(1, 6):
        # Construct the file path (e.g., '../cifar-10-batches-py/data_batch_1')
        filename = os.path.join(data_dir, f'data_batch_{i}')

        # Unpickle the dictionary
        batch_dict = unpickle(filename)

        # The dictionary keys are stored as byte strings (b'data' and b'labels')
        x_train_list.append(batch_dict[b'data'])
        y_train_list += batch_dict[b'labels']  # use += to combine the lists

    # 4. Combine the 5 lists into single numpy arrays
    # vstack stacks the 5 (10000, 3072) arrays into one (50000, 3072) array
    x_train = np.vstack(x_train_list)
    # Convert the combined labels list into a numpy array of type int64
    y_train = np.array(y_train_list, dtype=np.int64)

    # 5. Load the test batch
    test_filename = os.path.join(data_dir, 'test_batch')
    test_dict = unpickle(test_filename)

    x_test = test_dict[b'data']
    y_test = np.array(test_dict[b'labels'], dtype=np.int64)

    # 6. Reshape the image data!
    # -1 tells numpy to automatically figure out the first dimension (50000 or 10000)
    # 3, 32, 32 separates the flat 3072 row into 3 channels of 32x32 grids.
    # .astype(np.uint8) ensures the pixel values are standard 8-bit unsigned integers.
    x_train = x_train.reshape(-1, 3, 32, 32).astype(np.uint8)
    x_test = x_test.reshape(-1, 3, 32, 32).astype(np.uint8)

    ### END YOUR CODE

    return x_train, y_train, x_test, y_test

A quick note on why `b'data'` instead of `'data'`

Because the CIFAR-10 dataset is a bit older and was originally pickled in Python 2, loading it in modern Python 3 causes the string keys in the dictionary to be read as "byte strings". In Python, you write a byte string by putting a b in front of the quotes. So, we ask the dictionary for b'data' instead of 'data'.

Let me know when you get this successfully running in your main.py and are ready to tackle the LeNet model itself!

Running the code:

> python3 main.py
Loading and preprocessing...
/Users/macc/TAMU/Spring_26/CSCE-421/HW3/code/solution.py:37: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)
  batch_dict = pickle.load(fo, encoding='bytes')

Why is this happening? The message you are seeing is a VisibleDeprecationWarning, which is just a heads-up from the system, not an actual error that stops your code. The CIFAR-10 dataset files were originally created and saved (using Python's cPickle module) many years ago using a much older version of the NumPy library. You are currently using a very modern version of NumPy (version 2.4).

When your modern NumPy reads the old pickled file using pickle.load(fo, encoding='bytes'), it notices that the arrays were saved using an outdated formatting parameter (align=0). It prints this warning to let you know the format is deprecated, but it is still perfectly capable of reading the data. The data arrays loaded into your computer's memory are exactly what they need to be.

You can safely ignore this warning and move right along! Let me know what the next part of your assignment asks you to do to start building the LeNet model.

Part 2

Complete the function preprocessing(), you need to implement two ways of preprocessing the data: (1) rescaling: rescales the image pixels from range 0-255 to range 0-1; (2) normalization: centralize and rescale each image using its mean and variance. There is a parameter normalize in the preprocessing() to control whether to rescale or normalize the images.

Data preprocessing is a crucial step because neural networks learn much faster and more stably when the input numbers are small and on a standardized scale.

1. The Goal of the Function

The docstring tells us we have two arrays: train_images and test_images. Currently, they contain pixel values ranging from 0 to 255, and their data type is uint8 (8-bit integers). The docstring also specifies that our function must return these arrays as float64 (decimals). Therefore, the very first thing we must do is convert the data type of both arrays.

2. Method 1: Rescaling (`normalize=False`)

If the user passes normalize=False, we just need to squash the 0 to 255 range down to a 0 to 1 range. Because the maximum pixel value is 255, we can achieve this perfectly by simply dividing every single number in the arrays by 255.0.

3. Method 2: Normalization (`normalize=True`)

If the user passes normalize=True, we have to do something a bit more mathematically complex. We need to centralize the data (give it a mean of 0) and rescale it using variance (give it a spread of 1).

To do this, we must remember a critical rule from your notes:

The Rule: "You want your data to be subtracted and divided by the exact same numbers for training and test".
Why? If you calculate a separate mean for the test set, the prediction for one test image will mathematically change depending on what other test images happen to be in the set with it. This is a massive mistake in machine learning.
The Solution: We will calculate the mean and standard deviation (which is the square root of variance) using only the training images. Then, we will use those exact same training numbers to normalize both the train_images and the test_images.

The Python Implementation

Here is the code you can copy and paste into your solution.py file to complete the preprocess function:

import numpy as np

def preprocess(train_images, test_images, normalize=False):
    '''
    To preprocess the data by
        (1).Rescaling the pixels from integers in [0,255) to
            floats in [0,1), or
        (2).Normalizing each image using its mean and variance.
    ...
    '''
    ### YOUR CODE HERE

    # 1. Cast the arrays to float64 as required by the docstring's Returns section
    train_images = train_images.astype(np.float64)
    test_images = test_images.astype(np.float64)

    if not normalize:
        # (1) Rescaling: Squeeze the range down to
        train_images = train_images / 255.0
        test_images = test_images / 255.0

    else:
        # (2) Normalization: Centralize (subtract mean) and rescale (divide by std)

        # Calculate the mean and standard deviation using ONLY the training data.
        # axis=(0, 2, 3) calculates these statistics per color channel (R, G, B)
        # across all images (axis 0), heights (axis 2), and widths (axis 3).
        # keepdims=True ensures the shape aligns perfectly for subtraction/division later.
        train_mean = np.mean(train_images, axis=(0, 2, 3), keepdims=True)
        train_std = np.std(train_images, axis=(0, 2, 3), keepdims=True)

        # Prevent division by zero just in case a channel is completely blank
        train_std = np.maximum(train_std, 1e-7)

        # Apply the EXACT SAME training statistics to normalize BOTH datasets
        train_images = (train_images - train_mean) / train_std
        test_images = (test_images - train_mean) / train_std

    ### END YOUR CODE

    return train_images, test_images

A Quick Note on the Math

You might notice the prompt asks to rescale using "variance", but the code uses np.std (Standard Deviation). In statistics and deep learning, the standard formula to normalize data to a unit variance is to divide by the standard deviation (which is simply the square root of variance). Dividing directly by the raw variance is mathematically incorrect for this goal, so np.std is the proper NumPy tool to use here!

Let me know if the axis=(0, 2, 3) part makes sense or if you are ready to move on to building the actual LeNet model!

Part 3

Complete the class LeNet().
In particular, you need to complete functions __init__() and forward() in the class. The paper for LeNet can be found here (https://ieeexplore.ieee.org/document/726791) The network architecture is shown in the figure below (LeNet.png).

The subsampleing is implemented by using the max pooling. And the kernel size for all the convolutional layers are 5 × 5. The sequential layers are:

Inputs →

Convolution (6 out channels) → BN → ReLU → Max Pooling →

Convolution (16 out channels) → BN → ReLU → Max Pooling →

Reshape to vector → Fully-connected (120 out units) → BN → ReLU →

Fully-connected (84 out units) → BN → ReLU → Dropout → Outputs

(n_classes out units)

For this part, you are only allowed to use the APIs in torch.nn. Please refer to the PyTorch API documents below for the usage of those APIs before you use them: https://pytorch.org/docs/stable/nn.html

Building your first Neural Network from scratch is a major milestone! To implement the LeNet class, we need to map your professor's exact layer sequence to PyTorch functions inside the __init__ method, and then pass the data through them in the forward method.

Before we write the code, we must solve one critical mathematical puzzle: What is the size of the vector when we "Reshape to vector"?

1. The Math: Tracking the Tensor Shape

Just like we did in Question 1 of your homework, we need to track the size of our images as they pass through the convolutional layers using the formula: $Output = \frac{N - F}{Stride} + 1$ .

Inputs: The docstring says the input is [None, 3, 32, 32]. So we start with 3 channels, Height 32, Width 32.
First Convolution: 6 filters, $5 \times 5$ kernel, stride 1 (default).
- Size $= \frac{32 - 5}{1} + 1 = 28$ .
- New Volume: $6 \times 28 \times 28$
First Max Pooling: Standard max pooling uses a $2 \times 2$ window and a stride of 2, which cuts the image size exactly in half.
- Size $= \frac{28}{2} = 14$ .
- New Volume: $6 \times 14 \times 14$
Second Convolution: 16 filters, $5 \times 5$ kernel.
- Size $= \frac{14 - 5}{1} + 1 = 10$ .
- New Volume: $16 \times 10 \times 10$
Second Max Pooling: $2 \times 2$ window, stride 2.
- Size $= \frac{10}{2} = 5$ .
- New Volume: $16 \times 5 \times 5$

The Reshape Step: At this point, we have 16 feature maps, each $5 \times 5$ in size. To pass this into a Fully-Connected (Linear) layer, we must flatten it into a single 1D vector.

Vector size $= 16 \times 5 \times 5 =$ $400$ . (This means our first Linear layer must take exactly 400 inputs!)

2. Implementing `init()` (Defining the Layers)

In the __init__ function, we just initialize the tools (layers) we are going to use.

Conv2d: Takes (in_channels, out_channels, kernel_size).
Linear: Takes (in_features, out_features).
BatchNorm: Your notes specify a very important distinction here. We use BatchNorm2d for convolutional layers (because the data still has spatial Height and Width), but we use BatchNorm1d for fully-connected layers (because the data is now a flat 1D vector).

3. Implementing `forward(x)` (Passing the Data)

In the forward function, we take the input image x and pass it sequentially through the tools we defined in __init__.

To do the "Reshape to vector" step, we will use the PyTorch command x = x.view(x.size(0), -1).

x.size(0) is the batch size (the number of images). We never want to flatten images into each other!
-1 is a PyTorch shortcut that means "flatten all the remaining dimensions into a single row".

The Final Code

Here is the fully completed solution.py class implementation. You can replace your LeNet skeleton with this:

import torch
import torch.nn as nn

class LeNet(nn.Module):
    '''
    Build the LeCun network according to the architecture in the homework part 4(c)
    '''
    def __init__(self, n_classes=None):
        super(LeNet, self).__init__()

        ### YOUR CODE HERE

        # --- BLOCK 1: Convolutional Layers ---
        # 1. Convolution (3 in channels for RGB -> 6 out channels, 5x5 kernel)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5)
        self.bn1 = nn.BatchNorm2d(6)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

        # 2. Convolution (6 in channels -> 16 out channels, 5x5 kernel)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
        self.bn2 = nn.BatchNorm2d(16)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

        # --- BLOCK 2: Fully-Connected Layers ---
        # 3. Fully-connected (16*5*5 = 400 inputs -> 120 out units)
        self.fc1 = nn.Linear(in_features=16*5*5, out_features=120)
        self.bn3 = nn.BatchNorm1d(120)
        self.relu3 = nn.ReLU()

        # 4. Fully-connected (120 inputs -> 84 out units)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.bn4 = nn.BatchNorm1d(84)
        self.relu4 = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5) # 0.5 is the standard dropout probability

        # 5. Outputs (84 inputs -> n_classes out units)
        self.fc3 = nn.Linear(in_features=84, out_features=n_classes)

        ### END CODE HERE

    def forward(self, x):
        '''
        Run forward pass of the model defined in the above __init__() function
        '''
        ### YOUR CODE HERE

        # Pass through Block 1
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.pool1(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.pool2(x)

        # Reshape to vector (Flatten the 16x5x5 feature maps into 400-length vectors)
        # x.size(0) keeps the batch dimension separate.
        x = x.view(x.size(0), -1)

        # Pass through Block 2
        x = self.fc1(x)
        x = self.bn3(x)
        x = self.relu3(x)

        x = self.fc2(x)
        x = self.bn4(x)
        x = self.relu4(x)
        x = self.dropout(x)

        # Final Output (Logits)
        logits = self.fc3(x)

        ### END CODE HERE

        return logits

One last detail: Notice that the very last layer fc3 just returns the raw, un-normalized values (called "logits"). It does not apply the Softmax activation function. Why? Because in PyTorch, the standard loss function for multi-class classification (CrossEntropyLoss) actually calculates the Softmax math internally for better numerical stability! Therefore, we just return the raw logits.

Part 4

Try to read and understand the class LeNet_Cifar10(). Run the main.py to train and test the model. You need to train two models: One using normalization preprocessing and the default LeNet architecture as defined in part (c), and another one using rescaling preprocessing and LeNet without Dropout and Batch Normalization. Compare the results of the two models and analyze why there is an improvement on (or why it harms) the performance. Include your training log and test accuracy, as well as a short analysis of the results in your report. The training and testing may take 10-30 minutes to finish on a CPU ( 20 epochs).

Congratulations on making it to the final task! You have built the components, and now it is time to put them all together, run the training loop, and see the deep learning magic happen.

I will first explain what the LeNet_Cifar10 class is doing so you understand the code, then guide you on how to run both models, and finally provide the detailed theoretical analysis you need for your report based on your professor's notes.

1. Understanding the `LeNet_Cifar10` Class

This class is a "wrapper" that manages the actual training process. Here is what it does step-by-step:

The Optimizer & Loss: In __init__, it sets up the CrossEntropyLoss (which mathematically expects the raw logits your LeNet outputs) and the Adam optimizer. The optimizer is the tool that performs the "Gradient Descent" step you learned about, updating the weights using a learning rate of 0.001.
Mini-Batch SGD: Inside train(), it divides your 50,000 training images into mini-batches (num_batches). It shuffles the data at the start of every epoch, grabs a batch of images (x_batch), and passes them to the model.
The Golden Rule in Code (model.train() vs model.eval()): Notice that at the start of the epoch, it calls self.model.train(). Later, before doing validation or testing, it calls self.model.eval(). This is exactly what we discussed earlier! PyTorch uses these commands to tell the Batch Normalization and Dropout layers whether to use the live mini-batch statistics (Training phase) or the saved global running averages (Testing/Evaluation phase).
The Backprop: loss.backward() calculates the gradient (the sensitivities $δ$ ) for every weight, and optimizer.step() actually moves the weights down the hill.

2. How to Run Model 1 (The Default Model)

This is your "good" model with all the advanced deep learning tricks.

Open main.py and make sure the preprocess line is set to use Normalization: x_train, x_test = preprocess(x_train, x_test, normalize=True)
Make sure your LeNet class in solution.py is exactly as we wrote it earlier (with all the BatchNorm and Dropout layers).
Run python3 main.py.
Action Required: Copy the output printed in your terminal (the Epoch losses and Validation Accuracies) and the final Test Accuracy. Save this in your report!

3. How to Run Model 2 (The Stripped-Down Model)

This is the "barebones" model that relies purely on convolutions and linear layers without the modern tricks.

Open main.py and change the preprocess line to use Rescaling instead: x_train, x_test = preprocess(x_train, x_test, normalize=False)

Open solution.py and comment out the Batch Normalization and Dropout layers in your LeNet class. Your modified forward function should look like this:

def forward(self, x):
    # Block 1
    x = self.conv1(x)
    # x = self.bn1(x)  <-- COMMENTED OUT
    x = self.relu1(x)
    x = self.pool1(x)

    x = self.conv2(x)
    # x = self.bn2(x)  <-- COMMENTED OUT
    x = self.relu2(x)
    x = self.pool2(x)

    x = x.view(x.size(0), -1)

    # Block 2
    x = self.fc1(x)
    # x = self.bn3(x)  <-- COMMENTED OUT
    x = self.relu3(x)

    x = self.fc2(x)
    # x = self.bn4(x)  <-- COMMENTED OUT
    x = self.relu4(x)
    # x = self.dropout(x) <-- COMMENTED OUT

    logits = self.fc3(x)
    return logits

Run python3 main.py again.
Action Required: Save the new training log and final test accuracy for your report.

4. Analysis for your Report (Why Model 1 performs better)

Once your code finishes running, you will notice that Model 1 converges faster, achieves higher validation accuracy, and has a much better test accuracy than Model 2.

Here is the detailed analysis you can adapt for your report to explain exactly why this happens, drawing directly from your course concepts:

1. The Impact of Data Normalization vs. Rescaling In Model 2, the data was only rescaled to a range of 0 to 1. In Model 1, the data was fully normalized using the training set's mean and variance. As noted in the course, un-normalized inputs cause large features to dominate the weight updates, while small features get drowned out. By forcing the input data to have a zero-mean and unit-variance, Model 1 ensures that all weights update at a "similar pace", preventing the gradient descent from oscillating wildly and allowing the network to learn stable patterns right from the first layer.

2. The Impact of Batch Normalization Model 2 lacks Batch Normalization. As data passes through multiple layers of convolutions and ReLUs, the scale of the activations can shift dramatically, making it very hard for the optimizer to find a good path down the error surface. By adding Batch Normalization in Model 1, we force the activations at every layer to maintain a zero-mean and unit-variance. This "improves gradient flow through the network," reduces the model's dependence on lucky weight initialization, and allows the Adam optimizer to take confident, stable steps. Without it, Model 2 likely struggled to optimize its weights efficiently.

3. The Impact of Dropout (Combating Overfitting) Model 2 has no regularization, meaning it is highly susceptible to overfitting. Because the network has thousands of parameters, it likely started to memorize the training data (fitting the "stochastic and deterministic noise") rather than learning generalizable features. Model 1 utilizes Dropout (with a probability of 0.5) on its fully connected layer. By randomly turning off 50% of the neurons during every training pass, Dropout prevents neurons from "co-adapting" and forces the network to learn robust, redundant representations of the images. This acts as a massive ensemble of different networks, heavily regularizing the model and ensuring that the high training accuracy actually translates into high test accuracy.

Conclusion Summary for Report: Model 2 suffers from slower/unstable optimization (due to lack of input and batch normalization) and poor generalization (due to lack of Dropout regularization). Model 1 utilizes normalization to create a smooth, efficient gradient descent path, and relies on Dropout to combat overfitting, ultimately yielding a vastly superior out-of-sample (Test) accuracy.

# convolutional layers
x = self.conv1(x)
x = self.bn1(x)
x = self.relu1(x)
x = self.pool1(x)

x = self.conv2(x)
x = self.bn2(x)
x = self.relu2(x)
x = self.pool2(x)

# Reshape to vector
x = x.view(x.size(0), -1)

# fully-connected layers
x = self.fc1(x)
x = self.bn3(x)
x = self.relu3(x)

x = self.fc2(x)
x = self.bn4(x)
x = self.relu4(x)
x = self.dropout(x)

# Final output 
x = self.fc3(x)

# convolutional layers
x = self.conv1(x)
# (removed batch normalization)
x = self.relu1(x)
x = self.pool1(x)

x = self.conv2(x)
# (removed batch normalization)
x = self.relu2(x)
x = self.pool2(x)

# Reshape to vector
x = x.view(x.size(0), -1)   

# fully-connected layers
x = self.fc1(x)
# (removed batch normalization)
x = self.relu3(x)

x = self.fc2(x)
# (removed batch normalization)
x = self.relu4(x)
# (removed dropout)

# Final output 
x = self.fc3(x)

Looking at the results, Model 1 definitely outperformed the stripped-down Model 2, scoring a test accuracy of about 65.9% compared to 61.8%, while also hitting a much lower training loss. This makes a lot of sense when you think about the deep learning tricks we added. First off, using full data normalization gives the inputs a zero-mean and unit-variance, which helps all the weights update at a similar pace instead of having small weights oscillate while large weights dominate the updates. Then we have Batch Normalization, which is a game-changer because it improves the gradient flow through the network and makes the whole optimization process much more stable. That perfectly explains why our training loss dropped so much lower. On top of that, Model 1 used Dropout to randomly shut off neurons during training, preventing them from co-adapting. This basically acts like training a huge ensemble of models, forcing the network to learn redundant, robust features instead of just memorizing the training data. So, between the smoother learning path from the normalization steps and the heavy regularization from Dropout, it's no wonder Model 1 generalized so much better to the unseen test images!

Review (Answers)

Question 1

Output size = ((input size + 2 * padding - kernel size) / stride) + 1
Output size = 3 * 3 * 3 * 28 + 28

Question 2

Total trainable parameters = 294,912 + 589,824 + 32,768 = 917,504

Question 3

μ = \frac{1}{4} (z_{1} + z_{2} + z_{3} + z_{4}) = \frac{1}{4} [\begin{matrix} 12 + 14 + 14 + 12 \\ 0 + 10 + 10 + 0 \\ - 5 + 5 + 5 - 5 \end{matrix}] = [\begin{matrix} 13 \\ 5 \\ 0 \end{matrix}]

σ^{2} = \frac{1}{4} ({[\begin{array}{l} - 1 \\ - 5 \\ - 5 \end{array}]}^{2} + {[\begin{array}{l} 1 \\ 5 \\ 5 \end{array}]}^{2} + {[\begin{array}{l} 1 \\ 5 \\ 5 \end{array}]}^{2} + {[\begin{array}{l} - 1 \\ - 5 \\ - 5 \end{array}]}^{2}) = \frac{1}{4} ([\begin{matrix} 1 \\ 25 \\ 25 \end{matrix}] + [\begin{matrix} 1 \\ 25 \\ 25 \end{matrix}] + [\begin{matrix} 1 \\ 25 \\ 25 \end{matrix}] + [\begin{matrix} 1 \\ 25 \\ 25 \end{matrix}])

[{\hat{z}}_{1}, {\hat{z}}_{2}, {\hat{z}}_{3}, {\hat{z}}_{4}] = [\begin{array}{cccc} - 1 & 1 & 1 & - 1 \\ - 1 & 1 & 1 & - 1 \\ - 1 & 1 & 1 & - 1 \end{array}]

[\begin{array}{cccc} - 1 & 1 & 1 & - 1 \\ - 11 & - 9 & - 9 & - 11 \\ 9 & 11 & 11 & 9 \end{array}]

During training, batch normalization computes the mean and variance within eah mini-batch
During testing, it uses the moving mean and moving variance computed during training.

Batch size does not affect testing results because the mean and variance used at test time are fixed values calculated during training.

Question 4

You should be able to get a test accuracy of ~65% on Cifar-10

Are there learnable parameters in pooling layers?
- No, it just makes an opperation to take the maximum value
Are there learnable parameters in ReLU layers?
- No, it is just an opperation
Are there learnable parameters in BN layers?
- Yes, there are some extra learnable parameters (gamma, and beta)
- This gives you a way to learn how to modify normalization a little bit

Question 1

What is a Convolution?

Part 1: What is the size of the output feature map volume?

1. Spatial Dimensions (Height and Width)

2. Depth (Number of Channels)

Part 2: What is the number of parameters in this layer (including bias)?

Question 2

1. Skip connection

The Basics: Feature Maps and Stride

What is a Residual Block and a Skip Connection?

The Problem: The "Add" Crash

The Solution: The Projection Shortcut

Mathematically

2. Number of trainable parameters

The Core Formula for Trainable Parameters

Step 1: The First Convolution on the Main Path

Step 2: The Second Convolution on the Main Path

Step 3: The Skip Connection (Projection Shortcut)

Step 4: The Final Total

Final Answer Summary:

Question 3

Part 1

1. The Concepts: What is Batch Normalization?

2. Why Batch Normalization?

Why do we use Mini-Batches instead of the entire dataset?

Why is Batch Normalization calculated independently for each dimension (unit)?

3. Solving the Math Step-by-Step

Unit 1 (Row 1)

Unit 2 (Row 2)

Unit 3 (Row 3)

Final Answer

The Vector Approach (The Professor's Formula)

Why did I explain it row-by-row?

Summary

The Vectorized Approach: Computation

Part 2

1. The Concept: Why Scale and Shift?

2. Breaking Down the Parameters

3. Solving the Math Step-by-Step

3. The "What If" Scenario (Why must it be this way?)

Final Answer

4. LeNet for Image Recognition

Part 1

1. Understanding the CIFAR-10 Data Format

2. The Reshape Math (Why 3072 -> 3 x 32 x 32?)

3. Step-by-Step Implementation

The Code

A quick note on why b'data' instead of 'data'

Part 2

1. The Goal of the Function

2. Method 1: Rescaling (normalize=False)

3. Method 2: Normalization (normalize=True)

The Python Implementation

A Quick Note on the Math

Part 3

1. The Math: Tracking the Tensor Shape

2. Implementing __init__() (Defining the Layers)

3. Implementing forward(x) (Passing the Data)

The Final Code

Part 4

1. Understanding the LeNet_Cifar10 Class

2. How to Run Model 1 (The Default Model)

3. How to Run Model 2 (The Stripped-Down Model)

4. Analysis for your Report (Why Model 1 performs better)

Review (Answers)

Question 1

Question 2

Question 3

Question 4

A quick note on why `b'data'` instead of `'data'`

2. Method 1: Rescaling (`normalize=False`)

3. Method 2: Normalization (`normalize=True`)

2. Implementing `init()` (Defining the Layers)

3. Implementing `forward(x)` (Passing the Data)

1. Understanding the `LeNet_Cifar10` Class