10 - Attention, Transformers, and Large Language Models

Class: CSCE-421


Notes:

These slides are based on Chapter 12 of Deep Learning: Foundations and Concepts

Example

I swam across the river to get to the other bank.
I walked across the road to get cash from the bank.

07 - Inbox/Visual Aids/image-21.png452x170

Notes:

Neural language and word embedding

vn=Exn

Notes:

Transformer processing

The input data to a transformer is a set of vectors {xn} of dimensionality D, where n=1,,N

X~= TransformerLayer [X]

image-27.png174

Notes:

Attention coefficients

yn=m=1Nanmxm

where

anm0, and m=1Nanm=1.

Commonly used coefficients

anm=exp(xnTxm)i=1Nexp(xnTxi),an= Softmax [xnTx1xnTx2xnTxN]

We have a different set of coefficients for each output vector yn

Notes:

Attention in general (cross attention) - IMPORTANT

image-28.png351x254

Notes:

General attention in matrix form

Y=Softmax[QKT]V

where Softmax [L] takes the exponential of every element of L and then normalizes each row independently to sum to one

image-30.png428x177

Notes:

Self-attention without parameters

Y=Softmax[XXT]X

where Softmax [L] takes the exponential of every element of L and then normalizes each row independently to sum to one

Notes:

Self-attention with parameters

Q=XW(q)RN×Dk K=XW(k)RN×Dk V=XW(v)RN×Dv

image-31.png507

Notes:

Comparison of Cross and Self Attention

image-32.png507

Notes:

Dot-product scaled attention

Y=Attention(Q, K, V)Softmax[QKTDk]V

image-33.png164

Notes:

Multi-head attention

Hh=Attention(Qh, Kh, Vh) Qh=XWh(q)Kh=XWh(k)Vh=XWh(v) Y(X)= Concat [H1,,HH]W(o)

image-34.png373x410

Notes:

Question: Why are we doing this multi-head attention?

The most common version of attention is the dot product scaled multi-head self attention.

Transformer layers

Z= LayerNorm [Y(X)+X] Z=Y( LayerNorm [X])+X

image-35.png134x262

Notes:

MLP in Transformer layers

Transformer layers

X~= LayerNorm [MLP[Z]+Z] X~=MLP(Z)+Z, where Z= LayerNorm [Z]

Notes:

Positional encoding

Language models: Narrow sense

p(x1,,xN)=n=1Np(xnx1,,xn1)

Notes:

n-gram model and LLMs (Courtesy R. Kambhampati)

p(x1,,xN)=n=1Np(xnx1,,xn1)

Language models: Broad sense