06 - Multi-Layer Perceptron-Networks

#MachineLearning

"Now we are talking about something that is not linear"

XOR: A Limitation of the Linear Model

Pasted image 20260210094215.png|300

Linear model: just 2 layers: input layer and output layer
One way to make this model not linear is to make it wider (still 2 layers)
This is called kernel methods
- How to make it wider? you make the input vector to be larger?
- Usually we do not count the softmax layer
- The way to make it wider is to apply $ϕ (x)$
  - Now $x$ is a wider vector
  - This model will be not linear in terms of $x$
- How do we do transformations to a vector $x$ ?
  - We apply $ϕ$ transformations
  - $ϕ$ is a vector that makes the $x$ components be multiplied by some other $x$ components.
  - You end up increasing the dimension of $x$
In training, you only need this: $$\phi^T(x_i)\phi(x_j) = (x_i^Tx_j)$$
- The pairwise inner product with $ϕ$
There is a more efficient way to do this, which is deep learning (making a model to be deep)
Kernel methods = use pairwise inner product
- Can be computed very efficient

Decomposing XOR

Pasted image 20260210095557.png|350

Exclusive OR: $f = h_{1} \overset{―}{h_{2}} + \overset{―}{h_{1}} h_{2}$
Basically means:
- If $h_{1}$ and $h_{2}$ both give you positive or negative, the result is negative (-1).
- They need to be different for XOR to output positive (+1)
So they key here is that we use linear models but we use multiple of them and we put them together

Perceptrons for OR and AND

Pasted image 20260210100015.png|600

OR:
- If one or both of $x_{1}$ and $x_{2}$ is positive, then OR is positive
- Why +1.5?
  - It is just needed to represent OR
  - It will just represent the truth table.
  - The input for $x_{1}$ and $x_{2}$ is just +1 or -1
AND:
- If both $x_{1}$ and $x_{2}$ are positive, then AND is positive

Representing $f$ Using OR and AND

Pasted image 20260210100323.png|250

Here we work backward because we are given $f = h_{1} \overset{―}{h_{2}} + \overset{―}{h_{1}} h_{2}$
We need to expand each of $h_{1} h_{2}$
Here, $h_{1} h_{2}$ is the input and the output is $f$

Pasted image 20260210100636.png|400

The next step is to expand each of $h$
- Note that each of $h_{i}$ is a linear model of the input
The -1.5 will eliminate the term we added to implement OR/AND
Note the bar on the top means $- 1$

Pasted image 20260210100959.png|500

Now, since we know each of $h_{i}$ is a linear model of the inputs $x_{1}$ and $x_{2}$ , then we decompose even more
Note how this became a network with multiple layers!
Nothing is trained here
- This is just a completely hardwire network
- Solves problems that linear models cannot solve by their own
Will this method enable us to do more complex models? (something more complex than $f = h_{1} \overset{―}{h_{2}} + \overset{―}{h_{1}} h_{2}$ )
- Yes! you can always convert any logic function into this form
- As long as you can convert to this standard form you will be used to use a network just like this
  - The logic form will just look longer

The Multilayer Perceptron (MLP)

Pasted image 20260210101820.png|500

More layers allow us to implement $f$
These additional layers are called hidden layers

Universal Approximation

Any target function $f$ that can be decomposed into linear separators can be implemented by a 3 -layer MLP.
A sufficiently smooth separator can "essentially" be decomposed into linear separators.

Pasted image 20260210101948.png|500

Theory:

More units/More linear models -> more accurate predictions

The Neural Network

$E_{in}$ is not smooth (due to sign function), so cannot use gradient descent.
$sign (x) \approx \tan (x) ⟶$ gradient descent to minimize $E_{in}$ .

Pasted image 20260210102059.png|25

$θ$ needs to be differentiable
This function has been used for many years, until recently we changed this, but we will talk about this later
Note: we use training data to determine all the weights here.
The question is how do we obtain those connection weights?
- If I have this values on the left, how do we compute the values on the right (on the next layer)
The "1" on each layer is essentially the threshold term
- That is why this unit will have no input, but will have an output of a fixed value

Zooming into a Hidden Node

Pasted image 20260210102405.png|400

layer $ℓ$ parameters (notations)

\begin{array}{lll} signals in & s^{(ℓ)} & d^{(ℓ)} dimensional input vector \\ outputs & x^{(ℓ)} & d^{(ℓ)} + 1 dimensional output vector \\ weights in & W^{(ℓ)} & (d^{(ℓ - 1)} + 1) \times d^{(ℓ)} dimensional matrix \\ weights out & W^{(ℓ + 1)} & (d^{(ℓ)} + 1) \times d^{(ℓ + 1)} dimensional matrix \end{array}

layers $ℓ = 0, 1, 2, \dots, L$
layer $ℓ$ has "dimension" $d^{(ℓ)} ⟹ d^{(ℓ)} + 1$ nodes

\begin{array}{r} W^{(ℓ)} = [\begin{array}{cccc} w_{1}^{(ℓ)} & w_{2}^{(ℓ)} & \dots & w_{d^{(ℓ)}}^{(ℓ)} \\ ∣ & ⋮ & ∣ \end{array}] \\ W = {W^{(1)}, W^{(2)}, \dots, W^{(L)}} \leftarrow specifies the network \end{array}

Notes:

What are the operations to move between layers?
For each layer we have a signal ( $s$ ) go into a particular layer
For each layer we want to put the inputs for each unit into a vector, this vector is called $s^{(ℓ)}$
- Note that $d^{(ℓ)}$ does not consider the bias term
Summary: each layer has an input vector, and an output vector
At the end we will build a matrix where each row is one column (one layer)
What would be the size of this matrix?
- $d^{(ℓ - 1)} + 1 \times d^{(ℓ)}$
  - This will be on the exam!

The Linear Signal

Input $s^{(ℓ)}$ is a linear combination (using weights) of the outputs of the previous layer $x^{(ℓ - 1)}$ .

s^{(ℓ)} = {(W^{(ℓ)})}^{T} x^{(ℓ - 1)}

Pasted image 20260210102059.png|400

Pasted image 20260210104122.png|300

\begin{aligned} s_{j}^{(ℓ)} = {(w_{j}^{(ℓ)})}^{T} x^{(ℓ - 1)} \\ (recall the linear signal s = w^{T} x) \end{aligned}

s^{(ℓ)} \overset{θ}{\to} x^{(ℓ)}

The question:

If I have the input for one layer, how do we calculate the output?
- What is the operation? This is $θ$
  - We need to take a vector and apply $θ$ to every element (element-wise product with $θ$ )
Now, If I know the output values for this layer, how do we calculate the input values for the next layer?
- If you look at the graph, the input of every unit depends on the outputs of the previous layers, but how do we do this?
- What else do we need? -> the weights
  - We need to take 1 column from the weights matrix.
  - Remember the rows of the weights matrix is composed of the previous layer number of of units as rows and the next layer number of units as columns
    - For example between the second (1, $θ$ , $θ$ , $θ$ ) and third layers (1, $θ$ , $θ$ ) in the example above we will have a 4 * 2 matrix
    - In order to multiply the weights times the $x$ vector you need to do $x W^{T}$ to account for the different number of rows and columns on each vector/matrix