10 - Attention, Transformers, and Large Language Models

Class: CSCE-421


Notes:

These slides are based on Chapter 12 of Deep Learning: Foundations and Concepts

Example

I swam across the river to get to the other bank.
I walked across the road to get cash from the bank.

07 - Inbox/Visual Aids/image-21.png452x170

Notes:

Neural language and word embedding

vn=Exn

Notes:

Transformer processing

The input data to a transformer is a set of vectors {xn} of dimensionality D, where n=1,,N

X~= TransformerLayer [X]

image-27.png174

Notes:

Attention coefficients

yn=m=1Nanmxm

where

anm0, and m=1Nanm=1.

Commonly used coefficients

anm=exp(xnTxm)i=1Nexp(xnTxi),an= Softmax [xnTx1xnTx2xnTxN]

We have a different set of coefficients for each output vector yn

Notes:

Attention in general (cross attention) - IMPORTANT

image-28.png351x254

Notes:

General attention in matrix form

Y=Softmax[QKT]V

where Softmax [L] takes the exponential of every element of L and then normalizes each row independently to sum to one

image-30.png428x177

Notes:

Self-attention without parameters

Y=Softmax[XXT]X

where Softmax [L] takes the exponential of every element of L and then normalizes each row independently to sum to one

Notes:

Self-attention with parameters

Q=XW(q)RN×Dk K=XW(k)RN×Dk V=XW(v)RN×Dv

image-31.png507

Notes:

Comparison of Cross and Self Attention

image-32.png507

Notes:

Dot-product scaled attention

Y=Attention(Q, K, V)Softmax[QKTDk]V

image-33.png164

Notes:

Multi-head attention

Hh=Attention(Qh, Kh, Vh) Qh=XWh(q)Kh=XWh(k)Vh=XWh(v) Y(X)= Concat [H1,,HH]W(o)

image-34.png373x410

Notes:

Question: Why are we doing this multi-head attention?

The most common version of attention is the dot product scaled multi-head self attention.

Transformer layers

Z= LayerNorm [Y(X)+X] Z=Y( LayerNorm [X])+X

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-35.png134x262

Notes:

MLP in Transformer layers

Transformer layers

X~= LayerNorm [MLP[Z]+Z] X~=MLP(Z)+Z, where Z= LayerNorm [Z]

Notes:

Positional encoding

Notes:

Language models: Narrow sense

p(x1,,xN)=n=1Np(xnx1,,xn1)

Notes:

n-gram model and LLMs (Courtesy R. Kambhampati)

p(x1,,xN)=p(x1)p(x2x1)n=3Np(xnxn1,xn2)

Notes:

Language models: Broad sense

Notes:

Decoder transformers I

Notes:

|X| Decoder transformers II

Y=Softmax(X~ W(p))

where Y is a matrix whose nth row is ynT, and X~ is a matrix whose nth row is x~nT

...

Decoder transformers: casual language modeling

I swam across the river to get to the other bank.

image-37.png227
Figure 12.16 An illustration of the mask matrix for masked self-attention. Attention weights corresponding to the red elements are set to zero. Thus, in predicting the token 'across', the output can depend only on the input tokens '<start>' 'I' and 'swam'.

Notes:

Decoder transformer architecture

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex2/Visual Aids/image-36.png
Figure 12.15 Architecture of a GPT decoder transformer network. Here 'LSM' stands for linear-softmax and denotes a linear transformation whose learnable parameters are shared across the token positions, followed by a softmax activation function. Masking is explained in the text.

Notes:

Remember self attention:

Difference between training and generation/inference

Sampling strategies during generation/inference I

p(y1,,yN)=n=1Np(yny1,,yn1)

Notes:

Sampling strategies during generation/inference II

yi=exp(ai/T)jexp(aj/T)

Notes:

Encoder transformers: Masked language modeling

Notes:

Encoder transformer architecture

image-38.png

Figure 12.18 Architecture of an encoder transformer model. The boxes labelled 'LSM' denote a linear transformation whose learnable parameters are shared across the token positions, followed by a softmax activation function. The main differences compared to the decoder model are that the input sequence is not shifted to the right, and the 'look ahead' masking matrix is omitted and therefore, within each self-attention layer, every output token can attend to any of the input tokens.

Notes:

Sequence-to-sequence transformers

Notes:

Comparison of self and cross attention

image-39.png187x395 image-40.png161x397

Figure 12.19 Schematic illustration of one crossattention layer as used in the decoder section of a sequence-to-sequence transformer. Here Z denotes the output from the encoder section. Z determines the key and value vectors for the crossattention layer, whereas the query vectors are determined within the decoder section.

Notes:

Sequence to sequence transformer architecture

image-41.png

Figure 12.20 Schematic illustration of a sequence-to-sequence transformer. To keep the diagram uncluttered the input tokens are collectively shown as a single box, and likewise for the output tokens. Positional-encoding vectors are added to the input tokens for both the encoder and decoder sections. Each layer in the encoder corresponds to the structure shown in Figure 12.9, and each cross-attention layer is of the form shown in Figure 12.19.

Notes:

Large Language models: Pretraining

Notes:

...

Large language models: Emerging properties

Notes:

Large language models: Prompting

image-42.png426

Notes: