HW3 - Convolutional Networks
Class: CSCE-421
Notes:
Question 1
A single (15 × 15 × 3) image is passed through a convolutional layer with 28 filters, each of size (3 × 3 × 3). The padding size is 1 (1 unit at top, bottom, left, and right) and the stride size is also 1. What is the size of the output feature map volume? What is the number of parameters in this layer (including bias)? Note that, for simplicity, we consider filters of 3 × 3 × 3 as one filter, instead of three filters of size 3 × 3.
What is a Convolution?
To understand how Convolutional Neural Networks (CNNs) work from scratch, we first need to understand how a computer sees an image and how we can extract patterns from it.
To a computer, an image is simply a two-dimensional array of numbers. In a CNN, we use something called a filter (sometimes called a kernel), which is a much smaller box of numbers. The core idea of a convolution is to take this small filter and slide it across the input image. At every step, the network performs an element-wise multiplication between the numbers in the filter and the numbers in that specific patch of the image, and then sums them all up to produce a single number.
These filters are designed to detect important visual features, like vertical or horizontal edges, which combine to form shapes and objects. The numbers inside these filters are the parameters that the neural network actually learns from the data during training.
Part 1: What is the size of the output feature map volume?
To find the volume of the output, we need to calculate its spatial dimensions (Height and Width) and its depth (Number of output channels/slices).
You are given a single image of size:
This means:
- height = 15
- width = 15
- depth = 3
The layer has 28 filters, each of size:
That means:
- height = 3
- width = 3
- depth = 3
This is important:
- A filter must match the full input depth (Filters always extend to the full depth of the input volume), so since the input depth is 3, each filter must also have depth 3.
- By having the filter extend through the full depth, the network can learn correlations between channels. A filter can decide, for example, that a feature is only present if there is a high value in the Red channel but a low value in the Blue channel.
- When the filter sits on a spot in the input, it performs an element-wise multiplication across every single cell in its 3D volume.
- If your input is
(RGB) and your filter is , the filter must be . - All
multiplications ( ) are performed, and then all of them are summed together into a single number (plus a bias).
- If your input is
- Because these values are summed into one scalar for that specific location, the "depth" is collapsed. This is why a single filter always produces a 2D output map, regardless of how deep the input was.
Spatial vs. Depth Dimensions
- Width and Height (Spatial): We slide the filter across these because we want to find the same feature (like an eye or a bolt) no matter where it appears in the image (Translation Invariance).
- Depth (Channel): We do not slide the filter through the depth because the channels represent different types of information about the same pixels. We want to process all that information simultaneously to define a new feature.
| Dimension | Filter Behavior | Purpose |
|---|---|---|
| Width ( |
Slides (Stride) | Locates features horizontally. |
| Height ( |
Slides (Stride) | Locates features vertically. |
| Depth ( |
Fixed (Full Depth) | Combines channel data into a new feature. |
1. Spatial Dimensions (Height and Width)
When you slide a filter over an image, the output size naturally shrinks, and the pixels on the extreme borders are not treated as fairly as the pixels in the center. To prevent the image from shrinking too quickly, we use padding, which means we artificially add a border of zero-value pixels around the outside of the input image.
Your original image is
- Padding adds a border around the image so that the 3×3 filter can still fit at the edges without shrinking the output.
The problem also mentions a stride of 1. Stride simply dictates how many steps the filter takes when it slides across the image.
To calculate the exact size of the output, your professor provided this formula:
Let's plug your numbers into this formula for both height and width:
Notice how using a padding of 1 with a
2. Depth (Number of Channels)
Every time you slide a single filter across the entire image, it generates one completely independent 2D output slice. Because your layer uses 28 filters, the network will generate 28 completely independent output slices and stack them together.
In other words: Each filter produces one feature map. Since there are 28 filters, the output depth will be: 28
Final Answer for Part 1: Combining the height, width, and depth, the final output feature map volume is
Part 2: What is the number of parameters in this layer (including bias)?
A "parameter" is a specific weight or number that the network has to learn. To find the total number of parameters in this layer, we need to calculate the parameters for just one filter, and then multiply that by the total number of filters.
1. Parameters in a single filter Color images are not just flat grids; they have depth because they are made of 3 color channels (Red, Green, Blue). Therefore, a filter must also have depth so it can connect to every input channel at the same time. This is why your filter size is given as
To find the number of weights in one filter, you multiply those dimensions:
- Weights per filter =
weights.
Additionally, every single filter in a neural network gets exactly 1 bias parameter added to it, which acts as a threshold.
- Total parameters per filter =
parameters.
2. Total parameters in the layer You have 28 of these filters in total. Because each output slice is generated completely independently, every single filter has its own unique set of parameters. For this convolutional layer we have:
- Total parameters =
. - Total parameters =
.
Final Answer for Part 2: There are
Question 2
In this question, you can (a) assume padding of appropriate size is used in convolutional layers, and (b) ignore batch normalization. Given the residual block as below:
/CSCE-421/HWs/Visual%20Aids/image.png)
1. Skip connection
What projection shortcut operations are required on the skip connection?
This question introduces one of the most famous breakthroughs in modern deep learning: the Residual Network (ResNet).
The Basics: Feature Maps and Stride
- Feature Maps: When you pass an image through a convolutional filter, the output is called a feature map. If a layer has 256 filters, it will output 256 independent feature maps stacked together (this is the "depth" of the output volume).
- Stride: Stride dictates how many pixels the filter jumps when it slides across the image. A standard stride is 1. If you use a stride of 2, the filter jumps by 2 pixels at a time. This effectively skips every other pixel, which cuts the spatial dimensions (height and width) of the feature map exactly in half
What is a Residual Block and a Skip Connection?
In traditional neural networks, data flows straight down a single path, layer by layer. However, researchers found that if you make a network too deep (e.g., 56 layers), the performance actually gets worse because it becomes too mathematically difficult to optimize
To fix this, researchers invented the Residual Block. Instead of forcing the network to learn a completely new transformation from scratch at every layer, they added a second path called a skip connection (the long arrow going down the right side of your diagram)
- The Main Path (Left): Runs the data through standard convolutional filters to extract new, complex patterns.
- The Skip Connection (Right): Takes the original input and "skips" it past the convolutions, copying it directly to the bottom of the block.
At the very bottom of the block, you hit the "add" operation. The network takes the newly transformed data from the left path and adds it element-by-element to the original data from the right path.
The Problem: The "Add" Crash
Here is the golden rule of residual blocks: To add two volumes of data together, they must have the exact same spatial size (height and width) and the exact same number of feature maps (depth).
Let's track the data through your specific diagram to see why this rule creates a massive problem:
- The Input: You start with 128 feature maps. Let's pretend their spatial size is 32×32.
- The Main Path (Left):
- The first convolution uses 256 filters, changing the depth from 128 to 256 feature maps.
- It also uses a stride of 2, which cuts our 32×32 spatial size in half, shrinking it to 16×16.
- The second convolution has a stride of 1, so it keeps the size at 16×16 and the depth at 256 feature maps.
- The Skip Connection (Right):
- This path just copies the original input. So, it brings down data that has 128 feature maps and a spatial size of 32×32.
The Crash: At the "add" node, the network tries to add a 16 × 16 × 256 block (from the left) to a 32 × 32 × 128 block (from the right). Because the dimensions do not match at all, the math crashes.
The Solution: The Projection Shortcut
To fix this crash, we cannot just blindly copy the input down the skip connection. We have to apply a quick operation to the skip connection to force its dimensions to perfectly match the main path. This is what we call a "projection shortcut."
We need to fix two things on the skip connection:
- Fixing the Depth: We need to increase the number of feature maps from 128 to 256. To do this without messing up the actual image patterns, we use a 1x1 convolution with 256 filters. A 1x1 convolution is specifically used to change the number of feature maps.
- Fixing the Spatial Size: We need to shrink the height and width by a factor of 2. We do this by applying a stride of 2 to that same 1x1 convolution.
By putting a 1x1 convolution with a stride of 2 on the skip connection, the right path will output a 16 × 16 × 256 block. Now, both paths match perfectly, and the "add" operation will succeed!.
Mathematically
In a Residual Network, the standard formula for a block is
Here is a mathematical way to write your answer using LaTeX, which incorporates the tensor dimensions and the output size formula from your notes:
Let the input to the residual block be the tensor
To perform the element-wise residual addition
- Depth (Channel Expansion): We map the input channels to the target channels by applying a
convolution parameterized by weights . This performs a linear projection that increases the feature maps from to . - Spatial Size (Downsampling): We apply a stride of
to this convolution. Using the spatial dimension formula , substituting a filter size of and stride yields an output size of . This mathematically halves the height and width to exactly match the spatial dimensions of the main path.
2. Number of trainable parameters
What is the total number of trainable parameters in the block (you can ignore bias terms, but need to consider the skip connection)?
The Core Formula for Trainable Parameters
In #Question 1, we learned that the trainable parameters are the actual numbers (weights) inside the filters that the computer has to learn.
- First, we need to know the number of weights in one single filter. A filter looks at a 2D patch (Height × Width) and reaches all the way through the depth of the input data (Input Channels). So, weights in one filter = Height × Width × Input Channels.
- Second, we multiply that by the total number of filters used in the layer (which is equal to the number of Output Channels/Feature Maps).
Since the problem explicitly says we can ignore the bias terms, our final mathematical formula is simply: Parameters = (Filter Height × Filter Width × Input Feature Maps) × Output Feature Maps
(Note: You will notice that stride is not in this formula. Stride only changes how the filter moves across the image; it does not change the physical size of the filter itself, so it does not affect the number of parameters!)
Now, let's apply this formula to the three distinct convolutional layers in your residual block.
Step 1: The First Convolution on the Main Path
Looking at the left side of your diagram, the data first passes through a
- Filter Height
Width: - Input Feature Maps: The data coming into this block has 128 feature maps.
- Output Feature Maps (Number of Filters): The layer creates 256 feature maps.
Let's plug this into our formula:
- Weights per filter =
- Total parameters =
parameters.
Step 2: The Second Convolution on the Main Path
The data continues down the left side into the second
- Filter Height
Width: - Input Feature Maps: Pay close attention here! The input to this specific layer is the output of the previous layer. The previous layer output 256 feature maps.
- Output Feature Maps (Number of Filters): This layer also creates 256 feature maps.
Let's plug this into our formula:
- Weights per filter =
- Total parameters =
parameters.
Step 3: The Skip Connection (Projection Shortcut)
Remember from the previous question that the skip connection (the right path) cannot just be an empty wire. Because the dimensions didn't match, we had to add a
- Filter Height
Width: - Input Feature Maps: This path splits off at the very top of the block, where the original input has 128 feature maps.
- Output Feature Maps (Number of Filters): We established we need 256 feature maps to match the main path.
Let's plug this into our formula:
- Weights per filter =
- Total parameters =
parameters.
Step 4: The Final Total
To find the total number of trainable parameters in the entire residual block, we simply add the parameters from all three of these convolutions together:
- First Main Conv:
- Second Main Conv:
- Skip Connection Conv:
- Total =
Final Answer Summary:
The total number of trainable parameters in this residual block is 917,504. We calculate this by summing the parameters of the three convolutional operations (using the formula Height × Width × Input Depth × Output Filters without bias):
- First
layer: - Second
layer: - Skip Connection (
layer):
Question 3
Using batch normalization in neural networks requires computing the mean and variance of a tensor. Suppose a batch normalization layer takes vectors
where
It then applies a second transformation to obtain
In this question, you can assume that
Part 1
- (5 points) You forward-propagate a mini-batch of
examples in your network. Suppose you are at a batch normalization layer, where the immediately previous layer is a fully connected layer with 3 units. Therefore, the input to this batch normalization layer can be represented as the below matrix:
What are
1. The Concepts: What is Batch Normalization?
Batch Normalization is considered one of the most important breakthroughs in Deep Learning because it allows networks to train much faster and more stably.
The Problem: As data passes through the many layers of a neural network, the scale of the numbers can get messy. One node (unit) might output values in the thousands, while another node outputs decimals. If the numbers are on completely different scales, the network struggles to learn, and the learning process can oscillate or diverge.
The Solution: To fix this, we force the outputs of the layers to be on a standard, predictable scale. Specifically, we want the data coming out of each unit to have a mean (average) of 0 and a variance (spread) of 1. Your professor’s notes describe this perfectly: "you want zero-mean unit-variance activations? just make them so".
How it works (Mini-Batches & Dimensions):
- Mini-batch (
): Instead of passing the entire dataset through the network at once, we pass it in small chunks called mini-batches. In your homework, your mini-batch has 4 examples (e.g., 4 images). - Units (Rows): The layer right before this Batch Normalization step has 3 units. This means for each of the 4 examples, the network generates 3 numbers.
- The Matrix: In your
matrix, each column represents one of the 4 examples. Each row represents one of the 3 units in the network.
The Golden Rule of Batch Norm: Batch Normalization is calculated independently for each dimension (unit). This means you do not calculate the average of the whole matrix. You calculate the average and variance for Row 1 independently, then Row 2 independently, and then Row 3 independently.
2. Why Batch Normalization?
Why do we use Mini-Batches instead of the entire dataset?
To understand this, we need to quickly review how a neural network learns. The network looks at the data, makes a prediction, calculates how wrong it is (the error), and then updates its internal weights to be more accurate next time.
If you have a dataset of 1 million images, you have three choices for how to feed that data to the network:
1. Hardware and Memory Limits (The physical reason) The most straightforward reason we cannot pass the entire dataset at once is that computers simply do not have enough memory to hold it. When a neural network processes data, it has to store all the intermediate math calculations for every single image in the computer's graphics card (GPU) memory.
- In theory, a larger batch size is better, but you are strictly limited by your GPU memory.
- Most GPUs only have around 10 GB to 18 GB of memory. If you try to pass 1 million high-resolution images at the exact same time, the computer will immediately crash. Therefore, we are forced to divide the data into smaller, manageable chunks called mini-batches.
2. Learning Speed (The mathematical reason) If you use Batch Gradient Descent (passing the entire dataset at once), the network will process all 1 million images, calculate the total error, and then take a single update step. This means your network spent a massive amount of computational power just to learn one single thing.
- By using Mini-Batch Gradient Descent, we divide the dataset into small chunks (for example, batches of 32 images).
- The network looks at the first 32 images and updates its weights. Then it looks at the next 32 images and updates its weights again.
- By the time it finishes looking at the entire dataset, it has updated its weights thousands of times instead of just once. This allows the network to learn much faster and achieve a balanced, moderate convergence speed.
Why is Batch Normalization calculated independently for each dimension (unit)?
To understand this, we need to think about what the numbers coming out of those units actually represent.
1. The Danger of Mixed Scales
Imagine a network trying to predict heart attacks. The data passing through the network contains entirely different types of features: one unit might process a person's age (e.g.,
- If we just let these raw numbers flow into the network, the math gets severely distorted. Large weights (like the
salary) will completely dominate the network's updates, while small weights (like the age) will get drowned out and oscillate or fail to learn properly.
2. The Goal: A "Similar Pace" for Everything
To fix this, we want to force every single piece of data to speak the exact same mathematical language. The goal of Batch Normalization is to guarantee that the data coming out of every single unit has a mean (average) of exactly 0 and a variance (spread) of exactly 1. This ensures that all features update at a "similar pace".
3. Why it MUST be calculated independently
If we took the entire layer of units and calculated one giant average for all of them combined, the
Therefore, the only way to ensure every feature is on a level playing field is to compute the empirical mean and variance independently for each dimension (unit).
- By calculating the math independently for the "salary" unit, all the salaries are neatly squashed to an average of
. - By calculating the math independently for the "age" unit, all the ages are neatly squashed to an average of
. - Now, when the data moves to the next layer of the neural network, every single unit is outputting numbers on the exact same
-centered scale, allowing the network to train smoothly and efficiently without any single feature dominating the others.
3. Solving the Math Step-by-Step
Let's apply the formulas provided in your question to each row individually.
Unit 1 (Row 1)
The raw signals for the first unit across the 4 examples are:
Step A: Calculate the Mean (
Step B: Calculate the Variance (
Step C: Normalize (
- Standard deviation =
Unit 2 (Row 2)
The raw signals for the second unit are:
Step A: Calculate the Mean (
Step B: Calculate the Variance (
Step C: Normalize (
- Standard deviation =
Unit 3 (Row 3)
The raw signals for the third unit are:
Step A: Calculate the Mean (
Step B: Calculate the Variance (
Step C: Normalize (
- Standard deviation =
Final Answer
By stacking our normalized rows back together, the final normalized
(Notice how, despite starting with completely different ranges of numbers in the original matrix, the Batch Normalization successfully squashed every single row into the exact same standardized scale!)
The Vector Approach (The Professor's Formula)
In your professor's formula, the
, , ,
Look closely at the professor's formula for the mean:
Because you are adding vectors together, you must follow the rules of linear algebra, which dictate that vector addition is performed element-wise (row by row).
Let's plug the vectors into the professor's formula:
When you add those columns together, you add the top row together, the middle row together, and the bottom row together:
Notice what just happened! The resulting mean
is the mean of Unit 1 is the mean of Unit 2 is the mean of Unit 3
Why did I explain it row-by-row?
I broke it down row-by-row because your professor's notes explicitly state the golden rule of Batch Normalization: "compute the empirical mean and variance independently for each dimension".
If you look at later slides, the professor actually expands the notation to show this explicitly by using two indices:
represents the batch example (the columns). represents the dimension/unit (the rows).
Summary
- When the formula shows
, it is taking the entire column (all features of one example). - Because
and are calculated by adding those columns together, the linear algebra naturally computes the averages straight across the rows. - Therefore, the single vector formula
is just a shorthand way of saying "do this for every unit independently," which is exactly what we did!
The Vectorized Approach: Computation
Instead of calculating row-by-row, we can use the formal vector definitions of Batch Normalization, treating each example in the mini-batch as a full column vector. Operations like addition, subtraction, squaring, and division are performed element-wise.
Step 1: Define the Input Vectors (
Step 2: Calculate the Mean Vector (
Step 3: Calculate the Variance Vector (
Step 4: Normalize Each Vector (
Step 5: Reconstruct the Final Matrix (
Part 2
Continue with the above setting. Suppose
Now that you have successfully normalized the data (giving it a mean of
Here is the step-by-step explanation of why we do this and how to solve the math.
1. The Concept: Why Scale and Shift?
In the previous question, we squashed the output of every single unit so that it perfectly centered around
To fix this, the creators of Batch Normalization added a clever trick: after we normalize the data, we give the network the power to scale and shift the data into whatever range it actually needs.
(Gamma): The learned scaling factor (it stretches or squashes the numbers). (Beta): The learned shifting factor (it moves the numbers up or down).
If the network decides that the strict
2. Breaking Down the Parameters
Just like the mean and variance, the scaling (
The problem states that
- Unit 1 (Row 1): Scale by
, Shift by - Unit 2 (Row 2): Scale by
, Shift by - Unit 3 (Row 3): Scale by
, Shift by
3. Solving the Math Step-by-Step
Let's take the normalized matrix
\widehat{Z}=\left[\begin{array}{llll}
-1 & 1 & 1 & -1 \
-1 & 1 & 1 & -1 \
-1 & 1 & 1 & -1
\end{array}\right]
\bar{Z}=\left[\begin{array}{cccc}
-1 & 1 & 1 & -1 \
-11 & -9 & -9 & -11 \
9 & 11 & 11 & 9
\end{array}\right]
\left[\begin{array}{cccc}
12 & 14 & 14 & 12 \
0 & 10 & 10 & 0 \
-5 & 5 & 5 & -5
\end{array}\right]
- During Training: The
and are calculated live using the other images in the current batch. Therefore, the batch size and the specific images inside the batch physically alter the math. - During Testing: The
and are hardcoded constants. The network pulls the global running averages from its memory (saved from the training phase) and plugs them into the equation.
Because
3. The "What If" Scenario (Why must it be this way?)
To solidify this for your understanding, imagine what would happen if the testing batch size did affect the results.
Let's go back to our hospital X-Ray example:
- Imagine the hospital groups 10 patient X-Rays together to send through the neural network as a single batch.
- If the network calculated the mean and variance using those 10 specific patients, the numbers would squish and stretch based on the health of the entire group.
- This means Patient A's diagnosis could change from "Healthy" to "Sick" entirely depending on whether Patient B (who happened to be in the same batch) was healthy or sick!
This violates a fundamental rule of machine learning: a model's prediction for a specific data point must depend only on that data point. By using fixed training statistics, we guarantee that the testing batch size has zero influence on the prediction.
Final Answer
Here is a clear and professional way you can summarize this for your assignment:
"The batch size during testing does not affect the testing results or predictions for any individual sample. During the testing phase, the Batch Normalization layer no longer computes the mean and variance from the incoming batch data. Instead, it uses the fixed running averages of the mean and variance that were pre-computed and saved during the training phase. Because the normalization parameters (
4. LeNet for Image Recognition
In this coding assignment, you will need to complete the implementation of LeNet (LeCun Network) using Pytorch and apply the LeNet to the image recognition task on Cifar-10 (10-classes classification). The access to the Cifar-10 Dataset are here (https://www.cs.toronto.edu/~kriz/cifar.html). In addition, you will need to install the python packages “tqdm” and “pytorch”. The installation guides of PyTorch are in “readme.txt”. Please read carefully and follow the instructions. You are expected to implement your solution based on the given codes. The only file you need to modify is the “solution.py” file. You can test your solution by running the “main.py” file.
Part 1
Download and extract the Cifar10 Dataset from the link above. Put the data folder “cifar-10-batches-py” in the same directory of “code”. Read carefully the instructions and then complete the function
1. Understanding the CIFAR-10 Data Format
Before we write the code, we need to know how the CIFAR-10 creators saved the dataset. According to the dataset documentation in your sources:
- The Files: The training data is split into 5 files named
data_batch_1throughdata_batch_5. The testing data is in a single file calledtest_batch. - The Format: These files are not standard image files (like
.pngor.jpg). They are Python "pickled" objects, which is a way to save Python dictionaries directly to a file. - The Dictionary Contents: When you "unpickle" (open) one of these files, you get a dictionary containing:
data: A numpy array of size10000 x 3072. These are the actual pixels.labels: A list of 10,000 numbers (0-9) that represent the true class of each image.
2. The Reshape Math (Why 3072 -> 3 x 32 x 32?)
A standard color image has a Height, a Width, and 3 Color Channels (Red, Green, Blue).
- CIFAR-10 images are 32x32 pixels.
- Because they are color, they have 3 channels.
.
The creators of the dataset "flattened" these 3D images into a single 1D row of 3072 numbers to make them easier to save.
- The first 1024 numbers are the Red channel.
- The next 1024 numbers are the Green channel.
- The final 1024 numbers are the Blue channel.
To feed this data into a Convolutional Neural Network (LeNet), we cannot use a flat row. We must restore it to its 3D box shape: [Channels (3), Height (32), Width (32)].
3. Step-by-Step Implementation
Here is exactly how we will build your load_data function:
- Unpickling Helper: We will write a tiny function inside
load_datato open the pickled files and read the dictionaries. - Load Training Data: We will loop through numbers 1 to 5 to open
data_batch_1throughdata_batch_5, extracting the pixels and labels from each and appending them to a list. - Combine Arrays: We will use
numpyto stitch those 5 separate batches together into one giant array of 50,000 images and 50,000 labels. - Load Testing Data: We will open
test_batchto get our 10,000 testing images and labels. - Reshape & Type Cast: We will use the numpy
.reshape()command to convert the flat 3072 arrays into the(3, 32, 32)format. We will also force the data types to match your assignment's requirements (uint8for images,int64for labels).
The Code
Here is the complete code you can copy and paste into your solution.py file inside the load_data(data_dir): block. Make sure you import os, pickle, and numpy at the very top of your solution.py file!
import os
import pickle
import numpy as np
def load_data(data_dir):
'''
To load the Cifar-10 Dataset from files and reshape the
images arrays from shape to shape.
...
'''
### YOUR CODE HERE
# 1. Helper function to unpickle the files
def unpickle(file):
with open(file, 'rb') as fo:
# encoding='bytes' ensures Python 3 reads the byte strings properly
batch_dict = pickle.load(fo, encoding='bytes')
return batch_dict
# 2. Initialize empty lists to hold our batches of data
x_train_list = []
y_train_list = []
# 3. Loop through training batches 1 to 5
for i in range(1, 6):
# Construct the file path (e.g., '../cifar-10-batches-py/data_batch_1')
filename = os.path.join(data_dir, f'data_batch_{i}')
# Unpickle the dictionary
batch_dict = unpickle(filename)
# The dictionary keys are stored as byte strings (b'data' and b'labels')
x_train_list.append(batch_dict[b'data'])
y_train_list += batch_dict[b'labels'] # use += to combine the lists
# 4. Combine the 5 lists into single numpy arrays
# vstack stacks the 5 (10000, 3072) arrays into one (50000, 3072) array
x_train = np.vstack(x_train_list)
# Convert the combined labels list into a numpy array of type int64
y_train = np.array(y_train_list, dtype=np.int64)
# 5. Load the test batch
test_filename = os.path.join(data_dir, 'test_batch')
test_dict = unpickle(test_filename)
x_test = test_dict[b'data']
y_test = np.array(test_dict[b'labels'], dtype=np.int64)
# 6. Reshape the image data!
# -1 tells numpy to automatically figure out the first dimension (50000 or 10000)
# 3, 32, 32 separates the flat 3072 row into 3 channels of 32x32 grids.
# .astype(np.uint8) ensures the pixel values are standard 8-bit unsigned integers.
x_train = x_train.reshape(-1, 3, 32, 32).astype(np.uint8)
x_test = x_test.reshape(-1, 3, 32, 32).astype(np.uint8)
### END YOUR CODE
return x_train, y_train, x_test, y_test
A quick note on why b'data' instead of 'data'
Because the CIFAR-10 dataset is a bit older and was originally pickled in Python 2, loading it in modern Python 3 causes the string keys in the dictionary to be read as "byte strings". In Python, you write a byte string by putting a b in front of the quotes. So, we ask the dictionary for b'data' instead of 'data'.
Let me know when you get this successfully running in your main.py and are ready to tackle the LeNet model itself!
Running the code:
> python3 main.py
Loading and preprocessing...
/Users/macc/TAMU/Spring_26/CSCE-421/HW3/code/solution.py:37: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)
batch_dict = pickle.load(fo, encoding='bytes')
Why is this happening? The message you are seeing is a VisibleDeprecationWarning, which is just a heads-up from the system, not an actual error that stops your code. The CIFAR-10 dataset files were originally created and saved (using Python's cPickle module) many years ago using a much older version of the NumPy library. You are currently using a very modern version of NumPy (version 2.4).
When your modern NumPy reads the old pickled file using pickle.load(fo, encoding='bytes'), it notices that the arrays were saved using an outdated formatting parameter (align=0). It prints this warning to let you know the format is deprecated, but it is still perfectly capable of reading the data. The data arrays loaded into your computer's memory are exactly what they need to be.
You can safely ignore this warning and move right along! Let me know what the next part of your assignment asks you to do to start building the LeNet model.
Part 2
Complete the function preprocessing(), you need to implement two ways of preprocessing the data: (1) rescaling: rescales the image pixels from range 0-255 to range 0-1; (2) normalization: centralize and rescale each image using its mean and variance. There is a parameter normalize in the preprocessing() to control whether to rescale or normalize the images.
Data preprocessing is a crucial step because neural networks learn much faster and more stably when the input numbers are small and on a standardized scale.
1. The Goal of the Function
The docstring tells us we have two arrays: train_images and test_images. Currently, they contain pixel values ranging from 0 to 255, and their data type is uint8 (8-bit integers). The docstring also specifies that our function must return these arrays as float64 (decimals). Therefore, the very first thing we must do is convert the data type of both arrays.
2. Method 1: Rescaling (normalize=False)
If the user passes normalize=False, we just need to squash the 0 to 255 range down to a 0 to 1 range. Because the maximum pixel value is 255, we can achieve this perfectly by simply dividing every single number in the arrays by 255.0.
3. Method 2: Normalization (normalize=True)
If the user passes normalize=True, we have to do something a bit more mathematically complex. We need to centralize the data (give it a mean of 0) and rescale it using variance (give it a spread of 1).
To do this, we must remember a critical rule from your notes:
- The Rule: "You want your data to be subtracted and divided by the exact same numbers for training and test".
- Why? If you calculate a separate mean for the test set, the prediction for one test image will mathematically change depending on what other test images happen to be in the set with it. This is a massive mistake in machine learning.
- The Solution: We will calculate the mean and standard deviation (which is the square root of variance) using only the training images. Then, we will use those exact same training numbers to normalize both the
train_imagesand thetest_images.
The Python Implementation
Here is the code you can copy and paste into your solution.py file to complete the preprocess function:
import numpy as np
def preprocess(train_images, test_images, normalize=False):
'''
To preprocess the data by
(1).Rescaling the pixels from integers in [0,255) to
floats in [0,1), or
(2).Normalizing each image using its mean and variance.
...
'''
### YOUR CODE HERE
# 1. Cast the arrays to float64 as required by the docstring's Returns section
train_images = train_images.astype(np.float64)
test_images = test_images.astype(np.float64)
if not normalize:
# (1) Rescaling: Squeeze the range down to
train_images = train_images / 255.0
test_images = test_images / 255.0
else:
# (2) Normalization: Centralize (subtract mean) and rescale (divide by std)
# Calculate the mean and standard deviation using ONLY the training data.
# axis=(0, 2, 3) calculates these statistics per color channel (R, G, B)
# across all images (axis 0), heights (axis 2), and widths (axis 3).
# keepdims=True ensures the shape aligns perfectly for subtraction/division later.
train_mean = np.mean(train_images, axis=(0, 2, 3), keepdims=True)
train_std = np.std(train_images, axis=(0, 2, 3), keepdims=True)
# Prevent division by zero just in case a channel is completely blank
train_std = np.maximum(train_std, 1e-7)
# Apply the EXACT SAME training statistics to normalize BOTH datasets
train_images = (train_images - train_mean) / train_std
test_images = (test_images - train_mean) / train_std
### END YOUR CODE
return train_images, test_images
A Quick Note on the Math
You might notice the prompt asks to rescale using "variance", but the code uses np.std (Standard Deviation). In statistics and deep learning, the standard formula to normalize data to a unit variance is to divide by the standard deviation (which is simply the square root of variance). Dividing directly by the raw variance is mathematically incorrect for this goal, so np.std is the proper NumPy tool to use here!
Let me know if the axis=(0, 2, 3) part makes sense or if you are ready to move on to building the actual LeNet model!
Part 3
Complete the class LeNet().
In particular, you need to complete functions __init__() and forward() in the class. The paper for LeNet can be found here (https://ieeexplore.ieee.org/document/726791) The network architecture is shown in the figure below (LeNet.png).
The subsampleing is implemented by using the max pooling. And the kernel size for all the convolutional layers are 5 × 5. The sequential layers are:
The subsampleing is implemented by using the max pooling. And the kernel size for all the convolutional layers are 5 × 5. The sequential layers are:
For this part, you are only allowed to use the APIs in torch.nn. Please refer to the PyTorch API documents below for the usage of those APIs before you use them: https://pytorch.org/docs/stable/nn.html
Building your first Neural Network from scratch is a major milestone! To implement the LeNet class, we need to map your professor's exact layer sequence to PyTorch functions inside the __init__ method, and then pass the data through them in the forward method.
Before we write the code, we must solve one critical mathematical puzzle: What is the size of the vector when we "Reshape to vector"?
1. The Math: Tracking the Tensor Shape
Just like we did in Question 1 of your homework, we need to track the size of our images as they pass through the convolutional layers using the formula:
- Inputs: The docstring says the input is
[None, 3, 32, 32]. So we start with 3 channels, Height 32, Width 32. - First Convolution: 6 filters,
kernel, stride 1 (default). - Size
. - New Volume:
- Size
- First Max Pooling: Standard max pooling uses a
window and a stride of 2, which cuts the image size exactly in half. - Size
. - New Volume:
- Size
- Second Convolution: 16 filters,
kernel. - Size
. - New Volume:
- Size
- Second Max Pooling:
window, stride 2. - Size
. - New Volume:
- Size
The Reshape Step: At this point, we have 16 feature maps, each
- Vector size
. (This means our first Linearlayer must take exactly 400 inputs!)
2. Implementing __init__() (Defining the Layers)
In the __init__ function, we just initialize the tools (layers) we are going to use.
- Conv2d: Takes
(in_channels, out_channels, kernel_size). - Linear: Takes
(in_features, out_features). - BatchNorm: Your notes specify a very important distinction here. We use
BatchNorm2dfor convolutional layers (because the data still has spatial Height and Width), but we useBatchNorm1dfor fully-connected layers (because the data is now a flat 1D vector).
3. Implementing forward(x) (Passing the Data)
In the forward function, we take the input image x and pass it sequentially through the tools we defined in __init__.
To do the "Reshape to vector" step, we will use the PyTorch command x = x.view(x.size(0), -1).
x.size(0)is the batch size (the number of images). We never want to flatten images into each other!-1is a PyTorch shortcut that means "flatten all the remaining dimensions into a single row".
The Final Code
Here is the fully completed solution.py class implementation. You can replace your LeNet skeleton with this:
import torch
import torch.nn as nn
class LeNet(nn.Module):
'''
Build the LeCun network according to the architecture in the homework part 4(c)
'''
def __init__(self, n_classes=None):
super(LeNet, self).__init__()
### YOUR CODE HERE
# --- BLOCK 1: Convolutional Layers ---
# 1. Convolution (3 in channels for RGB -> 6 out channels, 5x5 kernel)
self.conv1 = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5)
self.bn1 = nn.BatchNorm2d(6)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
# 2. Convolution (6 in channels -> 16 out channels, 5x5 kernel)
self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
self.bn2 = nn.BatchNorm2d(16)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
# --- BLOCK 2: Fully-Connected Layers ---
# 3. Fully-connected (16*5*5 = 400 inputs -> 120 out units)
self.fc1 = nn.Linear(in_features=16*5*5, out_features=120)
self.bn3 = nn.BatchNorm1d(120)
self.relu3 = nn.ReLU()
# 4. Fully-connected (120 inputs -> 84 out units)
self.fc2 = nn.Linear(in_features=120, out_features=84)
self.bn4 = nn.BatchNorm1d(84)
self.relu4 = nn.ReLU()
self.dropout = nn.Dropout(p=0.5) # 0.5 is the standard dropout probability
# 5. Outputs (84 inputs -> n_classes out units)
self.fc3 = nn.Linear(in_features=84, out_features=n_classes)
### END CODE HERE
def forward(self, x):
'''
Run forward pass of the model defined in the above __init__() function
'''
### YOUR CODE HERE
# Pass through Block 1
x = self.conv1(x)
x = self.bn1(x)
x = self.relu1(x)
x = self.pool1(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.relu2(x)
x = self.pool2(x)
# Reshape to vector (Flatten the 16x5x5 feature maps into 400-length vectors)
# x.size(0) keeps the batch dimension separate.
x = x.view(x.size(0), -1)
# Pass through Block 2
x = self.fc1(x)
x = self.bn3(x)
x = self.relu3(x)
x = self.fc2(x)
x = self.bn4(x)
x = self.relu4(x)
x = self.dropout(x)
# Final Output (Logits)
logits = self.fc3(x)
### END CODE HERE
return logits
One last detail: Notice that the very last layer fc3 just returns the raw, un-normalized values (called "logits"). It does not apply the Softmax activation function. Why? Because in PyTorch, the standard loss function for multi-class classification (CrossEntropyLoss) actually calculates the Softmax math internally for better numerical stability! Therefore, we just return the raw logits.
Part 4
Try to read and understand the class LeNet_Cifar10(). Run the main.py to train and test the model. You need to train two models: One using normalization preprocessing and the default LeNet architecture as defined in part (c), and another one using rescaling preprocessing and LeNet without Dropout and Batch Normalization. Compare the results of the two models and analyze why there is an improvement on (or why it harms) the performance. Include your training log and test accuracy, as well as a short analysis of the results in your report. The training and testing may take 10-30 minutes to finish on a CPU ( 20 epochs).
Congratulations on making it to the final task! You have built the components, and now it is time to put them all together, run the training loop, and see the deep learning magic happen.
I will first explain what the LeNet_Cifar10 class is doing so you understand the code, then guide you on how to run both models, and finally provide the detailed theoretical analysis you need for your report based on your professor's notes.
1. Understanding the LeNet_Cifar10 Class
This class is a "wrapper" that manages the actual training process. Here is what it does step-by-step:
- The Optimizer & Loss: In
__init__, it sets up theCrossEntropyLoss(which mathematically expects the raw logits your LeNet outputs) and theAdamoptimizer. The optimizer is the tool that performs the "Gradient Descent" step you learned about, updating the weights using a learning rate of0.001. - Mini-Batch SGD: Inside
train(), it divides your 50,000 training images into mini-batches (num_batches). It shuffles the data at the start of every epoch, grabs a batch of images (x_batch), and passes them to the model. - The Golden Rule in Code (
model.train()vsmodel.eval()): Notice that at the start of the epoch, it callsself.model.train(). Later, before doing validation or testing, it callsself.model.eval(). This is exactly what we discussed earlier! PyTorch uses these commands to tell the Batch Normalization and Dropout layers whether to use the live mini-batch statistics (Training phase) or the saved global running averages (Testing/Evaluation phase). - The Backprop:
loss.backward()calculates the gradient (the sensitivities) for every weight, and optimizer.step()actually moves the weights down the hill.
2. How to Run Model 1 (The Default Model)
This is your "good" model with all the advanced deep learning tricks.
- Open
main.pyand make sure thepreprocessline is set to use Normalization:x_train, x_test = preprocess(x_train, x_test, normalize=True) - Make sure your
LeNetclass insolution.pyis exactly as we wrote it earlier (with all theBatchNormandDropoutlayers). - Run
python3 main.py. - Action Required: Copy the output printed in your terminal (the Epoch losses and Validation Accuracies) and the final Test Accuracy. Save this in your report!
3. How to Run Model 2 (The Stripped-Down Model)
This is the "barebones" model that relies purely on convolutions and linear layers without the modern tricks.
-
Open
main.pyand change the preprocess line to use Rescaling instead:x_train, x_test = preprocess(x_train, x_test, normalize=False) -
Open
solution.pyand comment out the Batch Normalization and Dropout layers in yourLeNetclass. Your modifiedforwardfunction should look like this:def forward(self, x): # Block 1 x = self.conv1(x) # x = self.bn1(x) <-- COMMENTED OUT x = self.relu1(x) x = self.pool1(x) x = self.conv2(x) # x = self.bn2(x) <-- COMMENTED OUT x = self.relu2(x) x = self.pool2(x) x = x.view(x.size(0), -1) # Block 2 x = self.fc1(x) # x = self.bn3(x) <-- COMMENTED OUT x = self.relu3(x) x = self.fc2(x) # x = self.bn4(x) <-- COMMENTED OUT x = self.relu4(x) # x = self.dropout(x) <-- COMMENTED OUT logits = self.fc3(x) return logits -
Run
python3 main.pyagain. -
Action Required: Save the new training log and final test accuracy for your report.
4. Analysis for your Report (Why Model 1 performs better)
Once your code finishes running, you will notice that Model 1 converges faster, achieves higher validation accuracy, and has a much better test accuracy than Model 2.
Here is the detailed analysis you can adapt for your report to explain exactly why this happens, drawing directly from your course concepts:
1. The Impact of Data Normalization vs. Rescaling In Model 2, the data was only rescaled to a range of 0 to 1. In Model 1, the data was fully normalized using the training set's mean and variance. As noted in the course, un-normalized inputs cause large features to dominate the weight updates, while small features get drowned out. By forcing the input data to have a zero-mean and unit-variance, Model 1 ensures that all weights update at a "similar pace", preventing the gradient descent from oscillating wildly and allowing the network to learn stable patterns right from the first layer.
2. The Impact of Batch Normalization Model 2 lacks Batch Normalization. As data passes through multiple layers of convolutions and ReLUs, the scale of the activations can shift dramatically, making it very hard for the optimizer to find a good path down the error surface. By adding Batch Normalization in Model 1, we force the activations at every layer to maintain a zero-mean and unit-variance. This "improves gradient flow through the network," reduces the model's dependence on lucky weight initialization, and allows the Adam optimizer to take confident, stable steps. Without it, Model 2 likely struggled to optimize its weights efficiently.
3. The Impact of Dropout (Combating Overfitting) Model 2 has no regularization, meaning it is highly susceptible to overfitting. Because the network has thousands of parameters, it likely started to memorize the training data (fitting the "stochastic and deterministic noise") rather than learning generalizable features. Model 1 utilizes Dropout (with a probability of 0.5) on its fully connected layer. By randomly turning off 50% of the neurons during every training pass, Dropout prevents neurons from "co-adapting" and forces the network to learn robust, redundant representations of the images. This acts as a massive ensemble of different networks, heavily regularizing the model and ensuring that the high training accuracy actually translates into high test accuracy.
Conclusion Summary for Report: Model 2 suffers from slower/unstable optimization (due to lack of input and batch normalization) and poor generalization (due to lack of Dropout regularization). Model 1 utilizes normalization to create a smooth, efficient gradient descent path, and relies on Dropout to combat overfitting, ultimately yielding a vastly superior out-of-sample (Test) accuracy.
# convolutional layers
x = self.conv1(x)
x = self.bn1(x)
x = self.relu1(x)
x = self.pool1(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.relu2(x)
x = self.pool2(x)
# Reshape to vector
x = x.view(x.size(0), -1)
# fully-connected layers
x = self.fc1(x)
x = self.bn3(x)
x = self.relu3(x)
x = self.fc2(x)
x = self.bn4(x)
x = self.relu4(x)
x = self.dropout(x)
# Final output
x = self.fc3(x)
# convolutional layers
x = self.conv1(x)
# (removed batch normalization)
x = self.relu1(x)
x = self.pool1(x)
x = self.conv2(x)
# (removed batch normalization)
x = self.relu2(x)
x = self.pool2(x)
# Reshape to vector
x = x.view(x.size(0), -1)
# fully-connected layers
x = self.fc1(x)
# (removed batch normalization)
x = self.relu3(x)
x = self.fc2(x)
# (removed batch normalization)
x = self.relu4(x)
# (removed dropout)
# Final output
x = self.fc3(x)
Looking at the results, Model 1 definitely outperformed the stripped-down Model 2, scoring a test accuracy of about 65.9% compared to 61.8%, while also hitting a much lower training loss. This makes a lot of sense when you think about the deep learning tricks we added. First off, using full data normalization gives the inputs a zero-mean and unit-variance, which helps all the weights update at a similar pace instead of having small weights oscillate while large weights dominate the updates. Then we have Batch Normalization, which is a game-changer because it improves the gradient flow through the network and makes the whole optimization process much more stable. That perfectly explains why our training loss dropped so much lower. On top of that, Model 1 used Dropout to randomly shut off neurons during training, preventing them from co-adapting. This basically acts like training a huge ensemble of models, forcing the network to learn redundant, robust features instead of just memorizing the training data. So, between the smoother learning path from the normalization steps and the heavy regularization from Dropout, it's no wonder Model 1 generalized so much better to the unseen test images!
Review (Answers)
Question 1
Output size = ((input size + 2 * padding - kernel size) / stride) + 1
Output size = 3 * 3 * 3 * 28 + 28
Question 2
Total trainable parameters = 294,912 + 589,824 + 32,768 = 917,504
Question 3
- During training, batch normalization computes the mean and variance within eah mini-batch
- During testing, it uses the moving mean and moving variance computed during training.
- Batch size does not affect testing results because the mean and variance used at test time are fixed values calculated during training.
Question 4
You should be able to get a test accuracy of ~65% on Cifar-10
- Are there learnable parameters in pooling layers?
- No, it just makes an opperation to take the maximum value
- Are there learnable parameters in ReLU layers?
- No, it is just an opperation
- Are there learnable parameters in BN layers?
- Yes, there are some extra learnable parameters (gamma, and beta)
- This gives you a way to learn how to modify normalization a little bit