06 - Multi-Layer Perceptron-Networks

"Now we are talking about something that is not linear"

XOR: A Limitation of the Linear Model

Pasted image 20260210094215.png|300

Decomposing XOR

Pasted image 20260210095557.png|350

Perceptrons for OR and AND

Pasted image 20260210100015.png|600

Representing f Using OR and AND

Pasted image 20260210100323.png|250

Pasted image 20260210100636.png|400

Pasted image 20260210100959.png|500

The Multilayer Perceptron (MLP)

Pasted image 20260210101820.png|500

Universal Approximation

Pasted image 20260210101948.png|500

Theory:

The Neural Network

Pasted image 20260210102059.png|325

Zooming into a Hidden Node

Pasted image 20260210102405.png|400

layer parameters (notations)

 signals in s()d() dimensional input vector  outputs x()d()+1 dimensional output vector  weights in W()(d(1)+1)×d() dimensional matrix  weights out W(+1)(d()+1)×d(+1) dimensional matrix 

layers =0,1,2,,L
layer has "dimension" d()d()+1 nodes

W()=[w1()w2()wd()()]W={W(1),W(2),,W(L)} specifies the network 

Notes:

The Linear Signal

Input s() is a linear combination (using weights) of the outputs of the previous layer x(1).

s()=(W())Tx(1)

Pasted image 20260210102059.png|400

Pasted image 20260210104122.png|300

sj()=(wj())Tx(1) (recall the linear signal s=wTx ) s()θx()

The question:

Forward Propagation: Computing h(x)

x=x(0)w(1)s(1)θx(1)W(2)s(2)θx(2)W(L)s(L)θx(L)=h(x).

Forward propagation to compute h(x):

  1. X(0)X
    • [Initialization]
  2. for =1 to L do
    • [Forward Propagation]
  3. S()(W())Tx(1)
  4. X()[1θ( s())]
  5. end for
  6. h(x)=x(L)
    • [Output]

Notes:

Minimizing Ein

Ein (h)=Ein (W)=1Nn=1N(h(xn)yn)2w={w(1),w(2),,w(L)}

Pasted image 20260211234601.png

Using θ=tanh makes Ein  differentiable so we can use gradient descent ⟶ local minimum.

Notes:

Gradient Descent

W(t+1)=W(t)ηEin( W(t))

Notes:

Gradient of Ein

Ein(w)=1Nn=1Nen(h(xn),yn)Ein(w)W()=1Nn=1Nen W()

We need

e(x)W()

Notes:

Algorithmic Approach

e(x) is a function of s() and s()=(W())Tx(1)

e W()=s()W()(e s())T (chain rule) =x(1)(δ())T

sensitivity

δ()=es()

Notes:


eW()=x(1)(δ())T

Pasted image 20260211234850.png|350

Notes:

Computing δ() Using the Chain Rule

δ(1)δ(2)δ(L1)δ(L)

Multiple applications of the chain rule:

Δs()θΔx()w(+1)Δs(+1)Δe(x)

Pasted image 20260211235009.png|700

Notes:

δ()=θ(s())[W(+1)δ(+1)]1d()

Pasted image 20260211235051.png|350

Notes:

The Backpropagation Algorithm

δ(1)δ(2)δ(L1)δ(L)

Pasted image 20260211235303.png|500

Algorithm for Gradient Descent on Ein

Pasted image 20260211235351.png|500

Digits Data

Pasted image 20260211235448.png