HW3 - Convolutional Networks

#MachineLearning #Convolution #NeuralNetworks

Notes:

Question 1

A single (15 × 15 × 3) image is passed through a convolutional layer with 28 filters, each of size (3 × 3 × 3). The padding size is 1 (1 unit at top, bottom, left, and right) and the stride size is also 1. What is the size of the output feature map volume? What is the number of parameters in this layer (including bias)? Note that, for simplicity, we consider filters of 3 × 3 × 3 as one filter, instead of three filters of size 3 × 3.

What is a Convolution?

To understand how Convolutional Neural Networks (CNNs) work from scratch, we first need to understand how a computer sees an image and how we can extract patterns from it.

To a computer, an image is simply a two-dimensional array of numbers. In a CNN, we use something called a filter (sometimes called a kernel), which is a much smaller box of numbers. The core idea of a convolution is to take this small filter and slide it across the input image. At every step, the network performs an element-wise multiplication between the numbers in the filter and the numbers in that specific patch of the image, and then sums them all up to produce a single number.

These filters are designed to detect important visual features, like vertical or horizontal edges, which combine to form shapes and objects. The numbers inside these filters are the parameters that the neural network actually learns from the data during training.

Part 1: What is the size of the output feature map volume?

To find the volume of the output, we need to calculate its spatial dimensions (Height and Width) and its depth (Number of output channels/slices).

You are given a single image of size:

15 \times 15 \times 15

This means:

height = 15
width = 15
depth = 3

The layer has 28 filters, each of size:

3 \times 3 \times 3

That means:

height = 3
width = 3
depth = 3

This is important:

A filter must match the full input depth (Filters always extend to the full depth of the input volume), so since the input depth is 3, each filter must also have depth 3.
By having the filter extend through the full depth, the network can learn correlations between channels. A filter can decide, for example, that a feature is only present if there is a high value in the Red channel but a low value in the Blue channel.
When the filter sits on a spot in the input, it performs an element-wise multiplication across every single cell in its 3D volume.
- If your input is $32 \times 32 \times 3$ (RGB) and your filter is $5 \times 5$ , the filter must be $5 \times 5 \times 3$ .
- All $75$ multiplications ( $5 \times 5 \times 3$ ) are performed, and then all of them are summed together into a single number (plus a bias).
Because these values are summed into one scalar for that specific location, the "depth" is collapsed. This is why a single filter always produces a 2D output map, regardless of how deep the input was.

Spatial vs. Depth Dimensions

Width and Height (Spatial): We slide the filter across these because we want to find the same feature (like an eye or a bolt) no matter where it appears in the image (Translation Invariance).
Depth (Channel): We do not slide the filter through the depth because the channels represent different types of information about the same pixels. We want to process all that information simultaneously to define a new feature.

Dimension	Filter Behavior	Purpose
Width ( $W$ )	Slides (Stride)	Locates features horizontally.
Height ( $H$ )	Slides (Stride)	Locates features vertically.
Depth ( $D$ )	Fixed (Full Depth)	Combines channel data into a new feature.

1. Spatial Dimensions (Height and Width)

When you slide a filter over an image, the output size naturally shrinks, and the pixels on the extreme borders are not treated as fairly as the pixels in the center. To prevent the image from shrinking too quickly, we use padding, which means we artificially add a border of zero-value pixels around the outside of the input image.

Your original image is $15 \times 15$ . Because the problem states there is a padding of 1, you are adding 1 pixel to the top, 1 to the bottom, 1 to the left, and 1 to the right. This makes your "effective" input size $15 + 1 + 1 = 17$ for both height and width.

Padding adds a border around the image so that the 3×3 filter can still fit at the edges without shrinking the output.

The problem also mentions a stride of 1. Stride simply dictates how many steps the filter takes when it slides across the image.

To calculate the exact size of the output, your professor provided this formula:

Output Size = \frac{Input Size - Filter Size}{Stride} + 1

Let's plug your numbers into this formula for both height and width:

Output Dimension = \frac{17 - 3}{1} + 1 = 15

Notice how using a padding of 1 with a $3 \times 3$ filter perfectly preserved your original $15 \times 15$ image size.

2. Depth (Number of Channels)

Every time you slide a single filter across the entire image, it generates one completely independent 2D output slice. Because your layer uses 28 filters, the network will generate 28 completely independent output slices and stack them together.

In other words: Each filter produces one feature map. Since there are 28 filters, the output depth will be: 28

Final Answer for Part 1: Combining the height, width, and depth, the final output feature map volume is
$15 \times 15 \times 28$

Part 2: What is the number of parameters in this layer (including bias)?

A "parameter" is a specific weight or number that the network has to learn. To find the total number of parameters in this layer, we need to calculate the parameters for just one filter, and then multiply that by the total number of filters.

1. Parameters in a single filter Color images are not just flat grids; they have depth because they are made of 3 color channels (Red, Green, Blue). Therefore, a filter must also have depth so it can connect to every input channel at the same time. This is why your filter size is given as $3 \times 3 \times 3$ (Height $\times$ Width $\times$ Input Channels).

To find the number of weights in one filter, you multiply those dimensions:

Weights per filter = $3 \times 3 \times 3 = 27$ weights.

Additionally, every single filter in a neural network gets exactly 1 bias parameter added to it, which acts as a threshold.

Total parameters per filter = $27 weights + 1 bias = 28$ parameters.

2. Total parameters in the layer You have 28 of these filters in total. Because each output slice is generated completely independently, every single filter has its own unique set of parameters. For this convolutional layer we have:

Total parameters = $28 filters \times 28 parameters per filter$ .
Total parameters = $784$ .

Final Answer for Part 2: There are $784$ parameters in this convolutional layer.

Question 2

In this question, you can (a) assume padding of appropriate size is used in convolutional layers, and (b) ignore batch normalization. Given the residual block as below:

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/HWs/Visual Aids/image.png320

1. Skip connection

What projection shortcut operations are required on the skip connection?

This question introduces one of the most famous breakthroughs in modern deep learning: the Residual Network (ResNet).

The Basics: Feature Maps and Stride

Feature Maps: When you pass an image through a convolutional filter, the output is called a feature map. If a layer has 256 filters, it will output 256 independent feature maps stacked together (this is the "depth" of the output volume).
Stride: Stride dictates how many pixels the filter jumps when it slides across the image. A standard stride is 1. If you use a stride of 2, the filter jumps by 2 pixels at a time. This effectively skips every other pixel, which cuts the spatial dimensions (height and width) of the feature map exactly in half

What is a Residual Block and a Skip Connection?

In traditional neural networks, data flows straight down a single path, layer by layer. However, researchers found that if you make a network too deep (e.g., 56 layers), the performance actually gets worse because it becomes too mathematically difficult to optimize

To fix this, researchers invented the Residual Block. Instead of forcing the network to learn a completely new transformation from scratch at every layer, they added a second path called a skip connection (the long arrow going down the right side of your diagram)

The Main Path (Left): Runs the data through standard convolutional filters to extract new, complex patterns.
The Skip Connection (Right): Takes the original input and "skips" it past the convolutions, copying it directly to the bottom of the block.

At the very bottom of the block, you hit the "add" operation. The network takes the newly transformed data from the left path and adds it element-by-element to the original data from the right path.

The Problem: The "Add" Crash

Here is the golden rule of residual blocks: To add two volumes of data together, they must have the exact same spatial size (height and width) and the exact same number of feature maps (depth).

Let's track the data through your specific diagram to see why this rule creates a massive problem:

The Input: You start with 128 feature maps. Let's pretend their spatial size is 32×32.
The Main Path (Left):
- The first convolution uses 256 filters, changing the depth from 128 to 256 feature maps.
- It also uses a stride of 2, which cuts our 32×32 spatial size in half, shrinking it to 16×16.
- The second convolution has a stride of 1, so it keeps the size at 16×16 and the depth at 256 feature maps.
The Skip Connection (Right):
- This path just copies the original input. So, it brings down data that has 128 feature maps and a spatial size of 32×32.

The Crash: At the "add" node, the network tries to add a 16 × 16 × 256 block (from the left) to a 32 × 32 × 128 block (from the right). Because the dimensions do not match at all, the math crashes.

The Solution: The Projection Shortcut

To fix this crash, we cannot just blindly copy the input down the skip connection. We have to apply a quick operation to the skip connection to force its dimensions to perfectly match the main path. This is what we call a "projection shortcut."

We need to fix two things on the skip connection:

Fixing the Depth: We need to increase the number of feature maps from 128 to 256. To do this without messing up the actual image patterns, we use a 1x1 convolution with 256 filters. A 1x1 convolution is specifically used to change the number of feature maps.
Fixing the Spatial Size: We need to shrink the height and width by a factor of 2. We do this by applying a stride of 2 to that same 1x1 convolution.

By putting a 1x1 convolution with a stride of 2 on the skip connection, the right path will output a 16 × 16 × 256 block. Now, both paths match perfectly, and the "add" operation will succeed!.

Mathematically

In a Residual Network, the standard formula for a block is $y = F (x) + x$ . However, when the dimensions of the main path $F (x)$ and the input $x$ do not match, we must apply a linear projection $W_{s}$ to the shortcut so that $y = F (x) + W_{s} (x)$ .

Here is a mathematical way to write your answer using LaTeX, which incorporates the tensor dimensions and the output size formula from your notes:

Let the input to the residual block be the tensor $x \in R^{H \times W \times 128}$ . The main convolutional path downsamples the spatial dimensions and increases the filters, yielding an output $F (x) \in R^{\frac{H}{2} \times \frac{W}{2} \times 256}$ .

To perform the element-wise residual addition $y = F (x) + W_{s} (x)$ , the projection shortcut $W_{s}$ must transform $x$ to match the exact dimensions of $F (x)$ .

Depth (Channel Expansion): We map the input channels to the target channels by applying a $1 \times 1$ convolution parameterized by weights $W_{s} \in R^{1 \times 1 \times 128 \times 256}$ . This performs a linear projection that increases the feature maps from $128$ to $256$ .
Spatial Size (Downsampling): We apply a stride of $S = 2$ to this $1 \times 1$ convolution. Using the spatial dimension formula $Output Size = ⌊ \frac{N - F}{S} ⌋ + 1$ , substituting a filter size of $F = 1$ and stride $S = 2$ yields an output size of $⌊ \frac{N - 1}{2} ⌋ + 1 = \frac{N}{2}$ . This mathematically halves the height and width to exactly match the $\frac{H}{2} \times \frac{W}{2}$ spatial dimensions of the main path.

2. Number of trainable parameters

What is the total number of trainable parameters in the block (you can ignore bias terms, but need to consider the skip connection)?

The Core Formula for Trainable Parameters

In #Question 1, we learned that the trainable parameters are the actual numbers (weights) inside the filters that the computer has to learn.

First, we need to know the number of weights in one single filter. A filter looks at a 2D patch (Height × Width) and reaches all the way through the depth of the input data (Input Channels). So, weights in one filter = Height × Width × Input Channels.
Second, we multiply that by the total number of filters used in the layer (which is equal to the number of Output Channels/Feature Maps).

Since the problem explicitly says we can ignore the bias terms, our final mathematical formula is simply: Parameters = (Filter Height × Filter Width × Input Feature Maps) × Output Feature Maps

(Note: You will notice that stride is not in this formula. Stride only changes how the filter moves across the image; it does not change the physical size of the filter itself, so it does not affect the number of parameters!)

Now, let's apply this formula to the three distinct convolutional layers in your residual block.

Step 1: The First Convolution on the Main Path

Looking at the left side of your diagram, the data first passes through a $3 \times 3$ convolution to create 256 feature maps.

Filter Height $\times$ Width: $3 \times 3$
Input Feature Maps: The data coming into this block has 128 feature maps.
Output Feature Maps (Number of Filters): The layer creates 256 feature maps.

Let's plug this into our formula:

Weights per filter = $3 \times 3 \times 128 = 1, 152$
Total parameters = $1, 152 weights \times 256 filters =$ $294, 912$ parameters.

Step 2: The Second Convolution on the Main Path

The data continues down the left side into the second $3 \times 3$ convolution.

Filter Height $\times$ Width: $3 \times 3$
Input Feature Maps: Pay close attention here! The input to this specific layer is the output of the previous layer. The previous layer output 256 feature maps.
Output Feature Maps (Number of Filters): This layer also creates 256 feature maps.

Let's plug this into our formula:

Weights per filter = $3 \times 3 \times 256 = 2, 304$
Total parameters = $2, 304 weights \times 256 filters =$ $589, 824$ parameters.

Step 3: The Skip Connection (Projection Shortcut)

Remember from the previous question that the skip connection (the right path) cannot just be an empty wire. Because the dimensions didn't match, we had to add a $1 \times 1$ convolution to it to increase the feature maps from 128 to 256. These $1 \times 1$ filters also contain learnable parameters!

Filter Height $\times$ Width: $1 \times 1$
Input Feature Maps: This path splits off at the very top of the block, where the original input has 128 feature maps.
Output Feature Maps (Number of Filters): We established we need 256 feature maps to match the main path.

Let's plug this into our formula:

Weights per filter = $1 \times 1 \times 128 = 128$
Total parameters = $128 weights \times 256 filters =$ $32, 768$ parameters.

Step 4: The Final Total

To find the total number of trainable parameters in the entire residual block, we simply add the parameters from all three of these convolutions together:

First Main Conv: $294, 912$
Second Main Conv: $589, 824$
Skip Connection Conv: $32, 768$
Total = $294, 912 + 589, 824 + 32, 768 = 917, 504$

Final Answer Summary:

The total number of trainable parameters in this residual block is 917,504. We calculate this by summing the parameters of the three convolutional operations (using the formula Height × Width × Input Depth × Output Filters without bias):

First $3 \times 3$ layer: $(3 \times 3 \times 128) \times 256 = 294, 912$
Second $3 \times 3$ layer: $(3 \times 3 \times 256) \times 256 = 589, 824$
Skip Connection ( $1 \times 1$ layer): $(1 \times 1 \times 128) \times 256 = 32, 768$

Question 3

Using batch normalization in neural networks requires computing the mean and variance of a tensor. Suppose a batch normalization layer takes vectors $z_{1}, z_{2}, \dots, z_{m}$ as input, where $m$ is the mini-batch size. It computes ${\hat{z}}_{1}, {\hat{z}}_{2}, \dots, {\hat{z}}_{m}$ according to

{\hat{z}}_{i} = \frac{z_{i} - μ}{\sqrt{σ^{2} + ϵ}}

where

μ = \frac{1}{m} \sum_{i = 1}^{m} z_{i}, σ^{2} = \frac{1}{m} \sum_{i = 1}^{m} {(z_{i} - μ)}^{2} .

It then applies a second transformation to obtain ${\tilde{z}}_{1}, {\tilde{z}}_{2}, \dots, {\tilde{z}}_{m}$ using learned parameters $γ$ and $β$ as

{\tilde{z}}_{i} = γ {\hat{z}}_{i} + β .

In this question, you can assume that $ϵ = 0$ .

Part 1

(5 points) You forward-propagate a mini-batch of $m = 4$ examples in your network. Suppose you are at a batch normalization layer, where the immediately previous layer is a fully connected layer with 3 units. Therefore, the input to this batch normalization layer can be represented as the below matrix:

[\begin{array}{cccc} 12 & 14 & 14 & 12 \\ 0 & 10 & 10 & 0 \\ - 5 & 5 & 5 & - 5 \end{array}]

What are ${\hat{z}}_{i}$ ? Please express your answer in a $3 \times 4$ matrix.

1. The Concepts: What is Batch Normalization?

Batch Normalization is considered one of the most important breakthroughs in Deep Learning because it allows networks to train much faster and more stably.

The Problem: As data passes through the many layers of a neural network, the scale of the numbers can get messy. One node (unit) might output values in the thousands, while another node outputs decimals. If the numbers are on completely different scales, the network struggles to learn, and the learning process can oscillate or diverge.

The Solution: To fix this, we force the outputs of the layers to be on a standard, predictable scale. Specifically, we want the data coming out of each unit to have a mean (average) of 0 and a variance (spread) of 1. Your professor’s notes describe this perfectly: "you want zero-mean unit-variance activations? just make them so".

How it works (Mini-Batches & Dimensions):

Mini-batch ( $m = 4$ ): Instead of passing the entire dataset through the network at once, we pass it in small chunks called mini-batches. In your homework, your mini-batch has 4 examples (e.g., 4 images).
Units (Rows): The layer right before this Batch Normalization step has 3 units. This means for each of the 4 examples, the network generates 3 numbers.
The Matrix: In your $3 \times 4$ matrix, each column represents one of the 4 examples. Each row represents one of the 3 units in the network.

The Golden Rule of Batch Norm: Batch Normalization is calculated independently for each dimension (unit). This means you do not calculate the average of the whole matrix. You calculate the average and variance for Row 1 independently, then Row 2 independently, and then Row 3 independently.

2. Why Batch Normalization?

Why do we use Mini-Batches instead of the entire dataset?

To understand this, we need to quickly review how a neural network learns. The network looks at the data, makes a prediction, calculates how wrong it is (the error), and then updates its internal weights to be more accurate next time.

If you have a dataset of 1 million images, you have three choices for how to feed that data to the network:

1. Hardware and Memory Limits (The physical reason) The most straightforward reason we cannot pass the entire dataset at once is that computers simply do not have enough memory to hold it. When a neural network processes data, it has to store all the intermediate math calculations for every single image in the computer's graphics card (GPU) memory.

In theory, a larger batch size is better, but you are strictly limited by your GPU memory.
Most GPUs only have around 10 GB to 18 GB of memory. If you try to pass 1 million high-resolution images at the exact same time, the computer will immediately crash. Therefore, we are forced to divide the data into smaller, manageable chunks called mini-batches.

2. Learning Speed (The mathematical reason) If you use Batch Gradient Descent (passing the entire dataset at once), the network will process all 1 million images, calculate the total error, and then take a single update step. This means your network spent a massive amount of computational power just to learn one single thing.

By using Mini-Batch Gradient Descent, we divide the dataset into small chunks (for example, batches of 32 images).
The network looks at the first 32 images and updates its weights. Then it looks at the next 32 images and updates its weights again.
By the time it finishes looking at the entire dataset, it has updated its weights thousands of times instead of just once. This allows the network to learn much faster and achieve a balanced, moderate convergence speed.

Why is Batch Normalization calculated independently for each dimension (unit)?

To understand this, we need to think about what the numbers coming out of those units actually represent.

1. The Danger of Mixed Scales
Imagine a network trying to predict heart attacks. The data passing through the network contains entirely different types of features: one unit might process a person's age (e.g., $62$ ), while another unit processes their annual salary (e.g., $40, 000$ ).

If we just let these raw numbers flow into the network, the math gets severely distorted. Large weights (like the $40, 000$ salary) will completely dominate the network's updates, while small weights (like the $62$ age) will get drowned out and oscillate or fail to learn properly.

2. The Goal: A "Similar Pace" for Everything
To fix this, we want to force every single piece of data to speak the exact same mathematical language. The goal of Batch Normalization is to guarantee that the data coming out of every single unit has a mean (average) of exactly 0 and a variance (spread) of exactly 1. This ensures that all features update at a "similar pace".

3. Why it MUST be calculated independently
If we took the entire layer of units and calculated one giant average for all of them combined, the $40, 000$ salary numbers would drag the average way up. If we then subtracted that giant average from the age unit, a $62$ -year-old would suddenly be represented by a massive negative number! The scales would still be completely ruined.

Therefore, the only way to ensure every feature is on a level playing field is to compute the empirical mean and variance independently for each dimension (unit).

By calculating the math independently for the "salary" unit, all the salaries are neatly squashed to an average of $0$ .
By calculating the math independently for the "age" unit, all the ages are neatly squashed to an average of $0$ .
Now, when the data moves to the next layer of the neural network, every single unit is outputting numbers on the exact same $0$ -centered scale, allowing the network to train smoothly and efficiently without any single feature dominating the others.

3. Solving the Math Step-by-Step

Let's apply the formulas provided in your question to each row individually.

Unit 1 (Row 1)

The raw signals for the first unit across the 4 examples are: $z^{(1)} =$

[\begin{array}{cccc} 12 & 14 & 14 & 12 \end{array}]

Step A: Calculate the Mean ( $μ$ ) Add them up and divide by $m = 4$ .

$μ_{1} = \frac{12 + 14 + 14 + 12}{4} = \frac{52}{4} = 13$

Step B: Calculate the Variance ( $σ^{2}$ ) Subtract the mean from each number, square the result, and average them.

$σ_{1}^{2} = \frac{(12 - 13)^{2} + (14 - 13)^{2} + (14 - 13)^{2} + (12 - 13)^{2}}{4}$
$σ_{1}^{2} = \frac{(- 1)^{2} + (1)^{2} + (1)^{2} + (- 1)^{2}}{4} = \frac{1 + 1 + 1 + 1}{4} = \frac{4}{4} = 1$

Step C: Normalize ( $\hat{z}$ ) Subtract the mean and divide by the square root of the variance (which is the standard deviation). Note: The problem says to assume $ϵ = 0$ .

Standard deviation = $\sqrt{1} = 1$
${\hat{z}}^{(1)} = \frac{- 13}{1} = [- 1, 1, 1, - 1]$

Unit 2 (Row 2)

The raw signals for the second unit are: $z^{(2)} =$ .

[\begin{array}{cccc} 0 & 10 & 10 & 0 \end{array}]

Step A: Calculate the Mean ( $μ$ )

$μ_{2} = \frac{0 + 10 + 10 + 0}{4} = \frac{20}{4} = 5$

Step B: Calculate the Variance ( $σ^{2}$ )

$σ_{2}^{2} = \frac{(0 - 5)^{2} + (10 - 5)^{2} + (10 - 5)^{2} + (0 - 5)^{2}}{4}$
$σ_{2}^{2} = \frac{(- 5)^{2} + (5)^{2} + (5)^{2} + (- 5)^{2}}{4} = \frac{25 + 25 + 25 + 25}{4} = \frac{100}{4} = 25$

Step C: Normalize ( $\hat{z}$ )

Standard deviation = $\sqrt{25} = 5$
${\hat{z}}^{(2)} = \frac{- 5}{5} = [\frac{- 5}{5}, \frac{5}{5}, \frac{5}{5}, \frac{- 5}{5}] = [- 1, 1, 1, - 1]$

Unit 3 (Row 3)

The raw signals for the third unit are: $z^{(3)} = [- 5, 5, 5, - 5]$ .

Step A: Calculate the Mean ( $μ$ )

$μ_{3} = \frac{- 5 + 5 + 5 - 5}{4} = \frac{0}{4} = 0$

Step B: Calculate the Variance ( $σ^{2}$ )

$σ_{3}^{2} = \frac{(- 5 - 0)^{2} + (5 - 0)^{2} + (5 - 0)^{2} + (- 5 - 0)^{2}}{4}$
$σ_{3}^{2} = \frac{25 + 25 + 25 + 25}{4} = \frac{100}{4} = 25$

Step C: Normalize ( $\hat{z}$ )

Standard deviation = $\sqrt{25} = 5$
${\hat{z}}^{(3)} = \frac{[- 5, 5, 5, - 5] - 0}{5} = [\frac{- 5}{5}, \frac{5}{5}, \frac{5}{5}, \frac{- 5}{5}] = [- 1, 1, 1, - 1]$

📝 Final Answer

By stacking our normalized rows back together, the final normalized $3 \times 4$ matrix $\hat{Z}$ is:

\hat{Z} = [\begin{array}{cccc} - 1 & 1 & 1 & - 1 - 1 & 1 & 1 & - 1 - 1 & 1 & 1 & - 1 \end{array}]

(Notice how, despite starting with completely different ranges of numbers in the original matrix, the Batch Normalization successfully squashed every single row into the exact same standardized scale!)

The Vector Approach (The Professor's Formula)

In your professor's formula, the $z_{i}$ terms are column vectors. In your specific problem, they look like this:

$z_{1} = [\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}]$ , $z_{2} = [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}]$ , $z_{3} = [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}]$ , $z_{4} = [\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}]$

Look closely at the professor's formula for the mean: $μ = \frac{1}{m} \sum_{i = 1}^{m} z_{i}$ .
Because you are adding vectors together, you must follow the rules of linear algebra, which dictate that vector addition is performed element-wise (row by row).

Let's plug the vectors into the professor's formula:
$μ = \frac{1}{4} ([\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}] + [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}] + [\begin{matrix} 14 \\ 10 \\ 5 \end{matrix}] + [\begin{matrix} 12 \\ 0 \\ - 5 \end{matrix}])$

When you add those columns together, you add the top row together, the middle row together, and the bottom row together:
$μ = \frac{1}{4} [\begin{matrix} 12 + 14 + 14 + 12 \\ 0 + 10 + 10 + 0 \\ - 5 + 5 + 5 + - 5 \end{matrix}] = \frac{1}{4} [\begin{matrix} 52 \\ 20 \\ 0 \end{matrix}] = [\begin{matrix} 13 \\ 5 \\ 0 \end{matrix}]$

Notice what just happened! The resulting mean $μ$ is a vector containing exactly the three numbers we found when we calculated it row-by-row.

$13$ is the mean of Unit 1
$5$ is the mean of Unit 2
$0$ is the mean of Unit 3

Why did I explain it row-by-row?

I broke it down row-by-row because your professor's notes explicitly state the golden rule of Batch Normalization: "compute the empirical mean and variance independently for each dimension".

If you look at later slides, the professor actually expands the notation to show this explicitly by using two indices: ${\hat{x}}_{i, j} = \frac{x_{i, j} - μ_{j}}{\sqrt{σ_{j}^{2} + ϵ}}$ . In this expanded notation:

$i$ represents the batch example (the columns).
$j$ represents the dimension/unit (the rows).

Summary

When the formula shows $z_{i}$ , it is taking the entire column (all features of one example).
Because $μ$ and $σ^{2}$ are calculated by adding those columns together, the linear algebra naturally computes the averages straight across the rows.
Therefore, the single vector formula ${\hat{z}}_{i} = \frac{z_{i} - μ}{\sqrt{σ^{2} + ϵ}}$ is just a shorthand way of saying "do this for every unit independently," which is exactly what we did!

The Vectorized Approach: Computation

Instead of calculating row-by-row, we can use the formal vector definitions of Batch Normalization, treating each example in the mini-batch as a full column vector. Operations like addition, subtraction, squaring, and division are performed element-wise.

Step 1: Define the Input Vectors ( $z_{i}$ ) Separate the input matrix into $m = 4$ column vectors, where each vector represents one example in the mini-batch: $$z_1 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}, \quad z_2 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_3 = \begin{bmatrix} 14 \ 10 \ 5 \end{bmatrix}, \quad z_4 = \begin{bmatrix} 12 \ 0 \ -5 \end{bmatrix}$$

Step 2: Calculate the Mean Vector ( $μ$ ) Add all column vectors together and divide by $m$ : $$\mu = \frac{1}{4} (z_1 + z_2 + z_3 + z_4) = \frac{1}{4} \begin{bmatrix} 12+14+14+12 \ 0+10+10+0 \ -5+5+5-5 \end{bmatrix} = \begin{bmatrix} 13 \ 5 \ 0 \end{bmatrix}$$

Step 3: Calculate the Variance Vector ( $σ^{2}$ ) Subtract the mean vector $μ$ from each input vector, square the resulting elements, and average them: $$\sigma^2 = \frac{1}{4} \left( (z_1 - \mu)^2 + (z_2 - \mu)^2 + (z_3 - \mu)^2 + (z_4 - \mu)^2 \right)$$ $$\sigma^2 = \frac{1}{4} \left( \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} 1 \ 5 \ 5 \end{bmatrix}^2 + \begin{bmatrix} -1 \ -5 \ -5 \end{bmatrix}^2 \right) = \frac{1}{4} \left( \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} + \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix} \right) = \begin{bmatrix} 1 \ 25 \ 25 \end{bmatrix}$$

Step 4: Normalize Each Vector ( ${\hat{z}}_{i}$ ) Apply the normalization formula ${\hat{z}}_{i} = \frac{z_{i} - μ}{\sqrt{σ^{2}}}$ (assuming $ϵ = 0$ ) using element-wise division. The standard deviation vector is $\sqrt{σ^{2}} = [\begin{matrix} 1 5 5 \end{matrix}]$ .

${\hat{z}}_{1} = [\begin{matrix} (12 - 13) / 1 (0 - 5) / 5 (- 5 - 0) / 5 \end{matrix}] = [\begin{matrix} - 1 - 1 - 1 \end{matrix}]$
${\hat{z}}_{2} = [\begin{matrix} (14 - 13) / 1 (10 - 5) / 5 (5 - 0) / 5 \end{matrix}] = [\begin{matrix} 1 1 1 \end{matrix}]$
${\hat{z}}_{3} = [\begin{matrix} (14 - 13) / 1 (10 - 5) / 5 (5 - 0) / 5 \end{matrix}] = [\begin{matrix} 1 1 1 \end{matrix}]$
${\hat{z}}_{4} = [\begin{matrix} (12 - 13) / 1 (0 - 5) / 5 (- 5 - 0) / 5 \end{matrix}] = [\begin{matrix} - 1 - 1 - 1 \end{matrix}]$

Step 5: Reconstruct the Final Matrix ( $\hat{Z}$ ) Stack the resulting column vectors back together to form the final $3 \times 4$ normalized matrix: $$\hat{Z} = [\hat{z}_1, \hat{z}_2, \hat{z}_3, \hat{z}_4] = \left[\begin{array}{cccc} -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \end{array}\right]$$

Part 2

Continue with the above setting. Suppose $γ = (1, 1, 1)$ , and $β = (0, - 10, 10)$ . What are ${\tilde{z}}_{i}$ ? Please express your answer in a $3 \times 4$ matrix.

Now that you have successfully normalized the data (giving it a mean of $0$ and a variance of $1$ ), you are ready for the second half of the Batch Normalization process: Scaling and Shifting.

Here is the step-by-step explanation of why we do this and how to solve the math.

1. The Concept: Why Scale and Shift?

In the previous question, we squashed the output of every single unit so that it perfectly centered around $0$ with a spread of $1$ . However, forcing every single layer to have the exact same rigid scale can sometimes be too restrictive and actually hurt the neural network's ability to learn complex patterns.

To fix this, the creators of Batch Normalization added a clever trick: after we normalize the data, we give the network the power to scale and shift the data into whatever range it actually needs.

$γ$ (Gamma): The learned scaling factor (it stretches or squashes the numbers).
$β$ (Beta): The learned shifting factor (it moves the numbers up or down).

If the network decides that the strict $0$ -mean and $1$ -variance was a bad idea, it can use $γ$ and $β$ to completely reverse the normalization and recover the original raw data.

2. Breaking Down the Parameters

Just like the mean and variance, the scaling ( $γ$ ) and shifting ( $β$ ) are applied independently to each dimension (unit/row).

The problem states that $γ = (1, 1, 1)$ and $β = (0, - 10, 10)$ . Because there are 3 units (rows) in our network, these vectors contain 3 numbers. Here is how they match up to our rows:

Unit 1 (Row 1): Scale by $γ_{1} = 1$ , Shift by $β_{1} = 0$
Unit 2 (Row 2): Scale by $γ_{2} = 1$ , Shift by $β_{2} = - 10$
Unit 3 (Row 3): Scale by $γ_{3} = 1$ , Shift by $β_{3} = 10$

3. Solving the Math Step-by-Step

Let's take the normalized matrix $\hat{Z}$ we calculated in the previous question: $$ \hat{Z} = \left[\begin{array}{cccc} -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \ -1 & 1 & 1 & -1 \end{array}\right] $$

We will now apply the formula ${\tilde{z}}_{i} = γ {\hat{z}}_{i} + β$ to each row element-by-element.

Unit 1 (Row 1):

Original normalized row: $[- 1, 1, 1, - 1]$
Multiply by $γ_{1} = 1$ : $[- 1, 1, 1, - 1] \times 1 = [- 1, 1, 1, - 1]$
Add $β_{1} = 0$ : $[- 1, 1, 1, - 1] + 0 = [- 1, 1, 1, - 1]$

Unit 2 (Row 2):

Original normalized row: $[- 1, 1, 1, - 1]$
Multiply by $γ_{2} = 1$ : $[- 1, 1, 1, - 1] \times 1 = [- 1, 1, 1, - 1]$
Add $β_{2} = - 10$ : $[- 1, 1, 1, - 1] - 10 = [- 11, - 9, - 9, - 11]$

Unit 3 (Row 3):

Original normalized row: $[- 1, 1, 1, - 1]$
Multiply by $γ_{3} = 1$ : $[- 1, 1, 1, - 1] \times 1 = [- 1, 1, 1, - 1]$
Add $β_{3} = 10$ : $[- 1, 1, 1, - 1] + 10 =$

📝 Final Answer

By stacking our newly scaled and shifted rows back together, the final output matrix $\tilde{Z}$ for the Batch Normalization layer is:

\tilde{Z} = [\begin{array}{cccc} - 1 & 1 & 1 & - 1 - 11 & - 9 & - 9 & - 11 9 & 11 & 11 & 9 \end{array}]