06 - Multi-Layer Perceptron-Networks

Class: CSCE-421


Notes:

"Now we are talking about something that is not linear"

XOR: A Limitation of the Linear Model

The Math

Pasted image 20260210094215.png300

The Meaning

1. The Problem: The XOR Limitation Up until now, you have been using simple 2-layer linear models (an input layer x directly connected to an output layer). These models act by drawing a single straight line (or flat plane) to separate the +1 class from the 1 class.

However, imagine the Exclusive OR (XOR) problem. In a 2D graph, suppose the top-left and bottom-right corners are +1, but the top-right and bottom-left corners are 1. Try drawing a single straight line to separate the positives from the negatives. You can't. A single linear model is mathematically incapable of solving XOR.

To solve non-linear problems like XOR, we have two main choices: make the model "wider" or make it "deeper".

2. Solution 1: Making it "Wider" (Feature Transforms & Kernel Methods) We already explored making models "wider" in the previous chapter on Polynomial Curve Fitting.

3. Solution 2: Making it "Deeper" (Deep Learning) Instead of relying on complex math tricks to expand x, we can solve the problem by combining multiple simple linear models together. This is the foundation of Deep Learning.

Decomposing XOR

The Math

Pasted image 20260210095557.png350

The Meaning

Decomposing XOR These notes prove how multiple linear models can solve XOR using the logic formula: f=h1h2+h1h2. Here is the plain-English breakdown of what this means:

By training two simple linear lines (h1 and h2) and feeding their answers into a new final layer that checks if they match, you have successfully built a multi-layer model that easily solves XOR.

Perceptrons for OR and AND

The Math
OR(x1,x2)=sign(x1+x2+1.5)AND(x1,x2)=sign(x1+x21.5)

Pasted image 20260210100015.png600

The Meaning

Logic Gates as Simple Perceptrons In the previous slide, we learned that a single linear model cannot solve the XOR problem. However, a single linear model can easily solve basic logic gates like OR and AND. Let's look at how the math works when our inputs (x1,x2) are strictly +1 (True) or 1 (False). The slide uses a clever trick with the bias (threshold) weight to switch between OR and AND:

Representing f Using OR and AND

The Math

Pasted image 20260210100323.png250

Pasted image 20260210100636.png400

Pasted image 20260210100959.png500

The Meaning

1. Representing f (XOR) by Stacking Perceptrons To solve the XOR problem (f=h1h2+h1h2), we cannot use just one perceptron. But, because XOR is just a combination of ANDs, ORs, and NOTs, we can wire multiple perceptrons together.

2. Hardwiring vs. Learning Your notes highlight a very important point: "Nothing is trained here. This is just a completely hardwire network". At this stage, we have manually calculated exactly what the weights and biases must be (like the 1.5 and 1.5) to solve XOR. We act as the "designer". In reality, when we build Neural Networks, we don't know what logic gates we need. We just set up the blank architecture (the nodes and connections) and let the Gradient Descent algorithm figure out all the weights and biases on its own! This manual example simply exists to prove mathematically that a multi-layer model is capable of solving a non-linear problem.

The Multilayer Perceptron (MLP)

The Math

Pasted image 20260210101820.png500

The Meaning

The Multilayer Perceptron (MLP) and Universal Approximation: By feeding the output of one linear model into the input of another, you have created a Multilayer Perceptron (MLP).

Universal Approximation

Pasted image 20260210101948.png500

Theory:

The Neural Network

The Math

Pasted image 20260210102059.png675

The Meaning

1. Why tanh(x)?

2. The Bias Units (The "1" Nodes) Your notes ask about the "1" on each layer. This is the exact same concept as the dummy coordinate x0=1 we used in Linear Regression and the Perceptron.

Zooming into a Hidden Node

The Math

Pasted image 20260210102405.png400

layer parameters (notations)

 signals in s()d() dimensional input vector  outputs x()d()+1 dimensional output vector  weights in W()(d(1)+1)×d() dimensional matrix  weights out W(+1)(d()+1)×d(+1) dimensional matrix 

layers =0,1,2,,L
layer has "dimension" d()d()+1 nodes

W()=[w1()w2()wd()()]W={W(1),W(2),,W(L)} specifies the network 

Notes:

The Meaning

1. Zooming into a Node (The Pipeline) To understand Neural Network math, you must understand the two-step pipeline that happens at every single layer .

2. Cracking the Exam Question: The Weight Matrix Dimensions Your notes emphasize that the size of the weight matrix W() will be on the exam. Let's break down why it has the dimensions (d(1)+1)×d() so you don't have to just memorize it.

The matrix W() is the bridge connecting layer 1 to layer .

The Linear Signal

The Math

Input s() is a linear combination (using weights) of the outputs of the previous layer x(1).

s()=(W())Tx(1)

Pasted image 20260210102059.png400

Pasted image 20260210104122.png300

sj()=(wj())Tx(1) (recall the linear signal s=wTx ) s()θx()

The question:

The Meaning

1. The Two-Step Process of a Neural Network Layer Your notes ask an excellent question: "Given the input of a layer, how do we compute the input for the next layer?" It is crucial to clarify the terminology here. Moving data through a Neural Network happens in a repeating two-step cycle for every layer :

2. Step A: Calculating the Raw Signal (s()) The mathematical formula for the first step is: s()=(W())Tx(1).

3. Step B: Calculating the Output (x()) Your notes state: "The operation is simple, you just apply θ. Take that vector, apply θ element-wise and add a 1." This is absolutely correct and forms the second half of the pipeline.

4. Exam Prep: Counting Parameters You noted that the exam will ask for the number of parameters. In a neural network, the "parameters" are simply the total number of weights (the individual elements inside the W matrices).

Forward Propagation: Computing h(x)

x=x(0)w(1)s(1)θx(1)W(2)s(2)θx(2)W(L)s(L)θx(L)=h(x).

Forward propagation to compute h(x):

  1. X(0)X
    • [Initialization]
  2. for =1 to L do
    • [Forward Propagation]
  3. s()(W())Tx(1)
  4. x()[1θ( s())]
  5. end for
  6. h(x)=x(L)
    • [Output]

Notes:

Minimizing Ein

The Math
Ein (h)=Ein (W)=1Nn=1N(h(xn)yn)2w={w(1),w(2),,w(L)}

Pasted image 20260211234601.png

Using θ=tanh makes Ein  differentiable so we can use gradient descent ⟶ local minimum.

Notes:

The Meaning

1. The Objective: Minimizing the Error (Ein) Just like in linear regression, the goal of training a neural network is to find the weights that make our predictions (h(xn)) as close to the true labels (yn) as possible. The slide uses the standard squared error function: $$E_{\text {in }}(\mathrm{W})=\frac{1}{N} \sum_{n=1}^N\left(h\left(\mathrm{x}_n\right)-y_n\right)^2$$

2. Correcting your note: The Training Cycle In your notes, you wrote: "A backward propagation is to adjust weights after a training process and then you need to do forward propagation." It is important to get the order of operations exactly right for your exam. One single step of training actually happens in this exact order:

  1. Forward Propagation: Pass a data point x through the network to get a prediction h(x), and calculate how wrong it is (the error).
  2. Backward Propagation (Backprop): Use the Chain Rule to calculate the gradient (the slope) of that error with respect to every single weight in the network.
  3. Weight Update (Gradient Descent): Adjust the weights by taking a small step in the negative gradient direction.

Gradient Descent

The Math
W(t+1)=W(t)ηEin( W(t))

Notes:

The Meaning

The Gradient Descent Update Rule Because a neural network has multiple layers, the "weight" parameter W is no longer just one vector. It is a massive collection of several matrices {W(1),W(2),,W(L)}, one for each layer. To update all these matrices at once, we use the standard Gradient Descent rule: $$\mathrm{W}(t+1)=\mathrm{W}(t)-\eta \nabla E_{\mathrm{in}}(\mathrm{~W}(t))$$

Gradient of Ein

The Math
Ein(w)=1Nn=1Nen(h(xn),yn)Ein(w)W()=1Nn=1Nen W()

We need

e(x)W()

Notes:

The Meaning

Divide and Conquer: The Gradient of One Sample How do we calculate the derivative for a massive network over a dataset of thousands of images? The slide shows a brilliant mathematical shortcut: Divide and Conquer.

The total error Ein is just the average of the individual errors (en) for every single data point. Because the derivative of a sum is just the sum of the derivatives, we can pull the derivative operator inside the summation: $$\frac{\partial E_{\mathrm{in}}(\mathrm{w})}{\partial \mathrm{W}^{(\ell)}} = \frac{1}{N} \sum_{n=1}^N \frac{\partial \mathrm{e}_n}{\partial \mathrm{~W}^{(\ell)}}$$What does this mean in plain English? It means we do not need to figure out the calculus for the entire dataset at once. We only need to figure out how to calculate the gradient for one single data point (eW()). Once we use the Chain Rule to find the derivative for one point, we simply run a loop: calculate the derivative for point 1, point 2, point 3... and then average them all together to get the true gradient for the whole dataset.

Algorithmic Approach

The Math

e(x) is a function of s() and s()=(W())Tx(1)

e W()=s()W()(e s())T (chain rule) =x(1)(δ())T

sensitivity

δ()=es()

Notes:


eW()=x(1)(δ())T

Pasted image 20260211234850.png350

Notes:

The Meaning

1. The Goal: Finding the Gradient To train the neural network, we need to know how to adjust every single weight matrix W() to make the final error e smaller. Mathematically, we need to find eW(). However, the weights W() do not directly calculate the error. The weights calculate the raw signal s(), which goes through an activation function to become x(), which eventually goes to the output layer to calculate the error. Because of this chain of events, we must use the Chain Rule from calculus.

2. The Chain Rule Split The slide splits the impossible derivative into two manageable pieces multiplied together: $$ \frac{\partial \mathrm{e}}{\partial \mathrm{W}^{(\ell)}} = \frac{\partial \mathrm{s}^{(\ell)}}{\partial \mathrm{W}^{(\ell)}} \cdot \left(\frac{\partial \mathrm{e}}{\partial \mathrm{s}^{(\ell)}}\right)^T $$

3. What is Sensitivity (δ)? Your notes on this are excellent! You correctly point out that sensitivity measures exactly how much a change in the raw input signal affects the final error e.

4. Putting it Together (The Matrix Math) When we substitute x(1) and δ() back into the Chain Rule equation, we get the final brilliant formula for the gradient: $$ \frac{\partial \mathrm{e}}{\partial \mathrm{W}^{(\ell)}} = \mathrm{x}^{(\ell-1)}\left(\boldsymbol{\delta}^{(\ell)}\right)^{\mathrm{T}} $$ Let's look at why this makes perfect sense, both logically and mathematically:

5. The Formula: Slope = Input × Sensitivity You are looking at the final formula from the Chain Rule: eW()=x(1)(δ())T. Let's translate this matrix math into plain English for a single weight connecting Node i to Node j:

6. Where do we get these numbers? As your notes point out, you get these two pieces of the puzzle at two different times:

7. CRITICAL EXAM CONCEPT: The "All-Zeros" Initialization Your notes bring up a classic, guaranteed exam question: "What happens if we initialize W to be all zeroes?"

Your notes answer this by saying that if W=0, the input to the activation function is 0. If you are using tanh, then tanh(0)=0, meaning the output x of all your hidden nodes becomes 0, paralyzing the network. This is completely true! However, there is an even bigger, more fundamental reason why this fails, which professors specifically look for: The Symmetry Breaking Problem.

Imagine you use the Logistic Sigmoid function instead, where θ(0)=0.5. The outputs are no longer zero. Does initializing with all zeros work now? No!

If you have 1,000 hidden nodes, but they all have the exact same weights and do the exact same math, you don't actually have 1,000 nodes—you effectively just have 1 node. The network will never be able to learn different, complex features (like one node looking for edges, and another looking for colors). Therefore, you must initialize weights with small, random numbers to "break the symmetry" so that different nodes can learn different things!

Computing δ() Using the Chain Rule

The Math
δ(1)δ(2)δ(L1)δ(L)

Multiple applications of the chain rule:

Δs()θΔx()w(+1)Δs(+1)Δe(x)

Pasted image 20260211235009.png700

Notes:

δ()=θ(s())[W(+1)δ(+1)]1d()

Pasted image 20260211235051.png350

Notes:

The Meaning

1. The Big Picture: Passing the Blame Backwards In the previous slide, we learned that to update the weights, we need to know the Sensitivity (δ()) of every single layer.

2. Breaking Down the Backprop Formula Your slide gives the master formula for how to compute the sensitivity of the current layer () if you already know the sensitivity of the next layer (+1): $$ \boldsymbol{\delta}^{(\ell)}=\theta^{\prime}\left(\mathbf{s}^{(\ell)}\right) \otimes\left[\mathrm{W}^{(\ell+1)} \boldsymbol{\delta}^{(\ell+1)}\right]_1^{d^{(\ell)}} $$ This formula executes the two "backward" steps required to pass through a layer:

3. The "Curve Gradient" and the Vanishing Gradient Problem You wrote a brilliant note here: "If your input value to θ is very large, your gradient will be close to 0."

Let's think about why this is mathematically true and practically dangerous:

The Backpropagation Algorithm

δ(1)δ(2)δ(L1)δ(L)

Pasted image 20260211235303.png500

Algorithm for Gradient Descent on Ein

Pasted image 20260211235351.png500

Digits Data

Pasted image 20260211235448.png