Cache Memories

#memory

Class: CSCE-312

Notes:

Today

Cache memory organization and operation
Performance impact of caches
- The memory mountain
- Rearranging loops to improve spatial locality
- Using blocking to improve temporal locality

Cache memory organization and operation

Example Memory Hierarchy

Pasted image 20251111135154.png|600

Cache Memories

Cache memories are small, fast SRAM-based memories managed automatically in hardware
- Hold frequently accessed blocks of main memory
CPU looks first for data in cache
Typical system structure:

Pasted image 20251114043018.png|550

General Cache Organization (S, E, B)

...
![[Cache Memories.pdf]]

Cache Size: C = S * E * B data bytes

Structure of the Cache

Sets (S = 2ˢ)

The cache is divided into S sets.
A set is like a “bucket” that can hold multiple cache lines.
The set index bits of an address determine which set a particular memory block maps to.

Lines per Set (E = 2ᵉ)

Each set contains E lines (also known as “ways”).
These are the slots that can hold cached blocks.
Determines associativity:
E = 1 → direct-mapped (1 line per set)
E = 2, 4, 8 → set associative
E = number of lines in entire cache → fully associative
Pattern:
- More lines per set → fewer conflict misses → more hardware complexity.

Block Size (B = 2ᵇ bytes per block)

Each cache line stores B bytes of contiguous memory.
B is also called the cache block size or cache line size.
Common sizes: 32B, 64B, 128B.
A bigger block = better spatial locality, but more wasted bytes on misses.

A Cache Line Contains:

A single cache line has the following fields:
1. Valid bit (v)

Indicates whether the line contains meaningful data.
v = 0 → line is empty/invalid
v = 1 → line contains a real memory block

2. Tag

Used to identify which memory block is stored in this line.
The upper bits of the address (tag bits) help match the requested block.

3. Data block (B bytes)

The actual bytes fetched from memory.
Indexed using the block offset bits of the address.

Cache Read

Locate set
Check if any line in set has matching tag
Yes + line valid: hit
Locate data starting at offset

Notes:

Divide the address that the machine generates into sets

Address of word:

t bits: tag
s bits: set index
b bits: block offset
- Data begins at this offset

MISS
HIT

Address Breakdown

Every memory address is divided into three parts:

Tag (t bits)
- Identifies which memory block is stored in a cache line.
- Compared against all lines in the selected set.
Set index (s bits)
- Selects one of the S = 2ˢ sets in the cache.
- Determines exactly which bucket of lines to look inside.
Block offset (b bits)
- Locates the exact byte within the B = 2ᵇ-byte block stored in the cache line.
- Low-order bits of the address.

Address of word:

|   tag   |  set index  |  block offset |
|  t bits |    s bits   |     b bits    |

Cache Lookup Process (Read Operation)

Step 1 — Locate the Set
- Use the set index bits to pick one set:
- set = address[set index bits]
  - Only this set could possibly contain the desired block.
Step 2 — Search the Lines Within the Set
- Each set contains E lines (E = 2ᵉ ways).
- For every line in the set, check:
  1. Valid bit = 1?
  2. Tag matches the tag bits of the address?
- If both are true → we found the block.
Step 3 — Determine Hit or Miss
- Hit
  - At least one line in the set has matching tag and valid bit = 1.
  - The cache returns the data immediately.
- Miss
  - No matching tag found in any line.
  - Cache must fetch the block from the next memory hierarchy level (L2, L3, DRAM).
Step 4 — Locate the Exact Byte Using Block Offset
- If the tag matched (hit):
  - Use the block offset bits to index into the B-byte data block.
  - Example:
    - If block size B = 32 bytes, offset selects byte 0–31.

Check slides for "Example: Direct Mapped Cache (E = 1)"

If tag doesn't match: old line is evicted and replaced

Check slides for "Direct-Mapped Cache Simulation"

Check slides for "E-way Set Associative Cache (Here: E = 2)"

A cache that is divided into sets of multiple blocks -> a set associate cache
Allows us to have a lower miss rate
If 2 addresses map to the same set of the cache we can work with that using "associativity".

Check slides for "2-Way Set Associative Cache Simulation"

What about writes?

Multiple copies of data exist:
- L1, L2, L3, Main Memory, Disk
What to do on a write-hit?
- Write-through (write immediately to memory)
- Write-back (defer write to memory until replacement of line)
  - Need a dirty bit (line different from memory or not)
What to do on a write-miss?
- Write-allocate (load into cache, update line in cache)
  - Good if more writes to the location follow
- No-write-allocate (writes straight to memory, does not load into cache)
Typical
- Write-through + No-write-allocate
- Write-back + Write-allocate

Intel Core i7 Cache Hierarchy

L1 i-cahce and d-cache:
- 32 KB, 8-way
- Access: 4 cycles
L2 unified cache:
- 256 KB, 8-way
- Access: 10 cycles
L3 unified cache:
- 8 MB, 16-way
- Access: 40-75 cycles
Block size: 64 bytes for all caches

Pasted image 20251114044631.png|425

i-cache: instructions
d-cache: data
Block size, E and number of sets is all you need to determine the capacity of sets.
Word size does not actually matter because the fundamental unit of a cache is blocks.
- Dealing with words is something we do below the level of the cache
- Word is the fundamental atomic unit that the processor deal with internally -> think of it as fractions of words (bytes)
- Block size is the fundamental unit for memory and caches
  - Memory comes in pages, those are the main unit when we talk about virtual memory
  - Sectors is a unit for disks

Cache Performance Metrics

Miss Rate
- Fraction of memory references not found in cache (misses / accesses) = 1 - hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
- Time to deliver a line in the cache to the processor
  - Includes time to determine whether the line is in the cache
- Typical numbers:
  - 4 clock cycle for L1
  - 10 clock cycles for L2
Miss Penalty
- Additional time required because of a miss
  - typically 50-200 cycles for main memory (Trend: increasing!)

Notes:

Miss rate can be small for large caches
A miss can be much more costly than a hit
- Every time we do a hit we do that roundtrip time
- In a miss we have to go to the next level of memory where that hit time is grater, it takes longer, and gets worse and worse as you go down hierarchies.
The trend in recent years is for this to be increasing.

Let's think about those numbers

Huge difference between a hit and a miss
- Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
- Consider:
  - cache hit time of 1 cycle
  - miss penalty of 100 cycles
- Average Access Time:
  - AMAT = Hit time + Miss rate × Miss penalty
  - Example:
    - 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
      - The expected value of the average access time is given by this equation
      - We hit memory in 1 cycle
      - The miss takes 100 cycles
      - The average access time is 1 cycle + a 3% chance of the 100 cycles miss.
    - 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
This is why "miss rate" is used instead of "hit rate"

Notes:

Examples in code:
- Generally minimization of this is in the inner loops
- Try to go sequentially accessing through the cache -> remember spatial locality
  - If we access the very next word, that is in the cache, and the next bytes are also in the cache and so on. (things we know there are in the cache because of spatial locality)

Writing Cache Friendly Code

Make the common case go fast
- Focus on the inner loops of the core functions
Minimize the misses in the inner loops
- Repeated references to variables are good (temporal locality)
- Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories

Performance impact of caches

The Memory Mountain

Read throughput (read bandwidth)
- Number of bytes read from memory per second (MB/s)
Memory mountain: Measured throughput as a function of spatial and temporal locality
- Compact way to characterize memory system performance.

Memory Mountain Test Function

mountain/mountain.c

long data[MAXELEMS]; /* Global array to traverse */

/* test - Iterate over first "elems" elements of
* array “data” with stride of "stride", using 
* using 4x4 loop unrolling. 
*/ 
int test(int elems, int stride) {
	long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
	long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
	long length = elems, limit = length - sx4;
	
	/* Combine 4 elements at a time */
	for (i = 0; i < limit; i += sx4) {
		acc0 = acc0 + data[i];
		acc1 = acc1 + data[i+stride];
		acc2 = acc2 + data[i+sx2];
		acc3 = acc3 + data[i+sx3];
	}
	
	/* Finish any remaining elements */
	for (; i < length; i++) {
		acc0 = acc0 + data[i];
	}
	return ((acc0 + acc1) + (acc2 + acc3));
}

Call test() with many combinations of elems and stride
For each elems and stride:
1. Call test() once to warm up the caches
2. Call test() again and measure the read throughput (MB/s)

Pasted image 20251114050215.png|500

Rearranging loops to improve spatial locality

Matrix Multiplication Example

matmult/mm.c

/* ijk */
for (i=0; i<n; i++) {
	for (j=0; j<n; j++) {
		sum = 0.0;            // variable `sum` held in register
		for (k=0; k<n; k++) 
			sum += a[i][k] * b[k][j];
		c[i][j] = sum;
	}
}

Canonical example of well vs. poorly written code
Matrix multiplication is very common in the world
Description:
- Multiply NxN matrices
- Matrix elements are doubles (8 bytes)
- O(N^3) total operatons
- N reads per source element
- N values summed per destination
  - but may be able to hold in register

Notes:

You have to take the dot product of each column of one matrix with each row of the other matrix.
Dot product: take the pairwise product of elements
Any faster algorithms for Matrix multiplication?
- Strassus algorithm: Rearranges things to save a few multiplications
  - It is asymptotically better than O(N^3)
We are accessing a with good spatial locality -> going one by one through each row of a (accessing a sequentially)
Poor spatial locality is each time we change k since we are going to a different row in the matrix.

Miss Rate Analysis for Matrix Multiply

Assume:
- Block size = 32 MB (big enough for four doubles)
- Matrix dimension (N) is very large
  - Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows
Analysis Method:
- Look at access pattern on inner loop

Pasted image 20251114050846.png|400

Layout of C Arrays in Memory (review)

C arrays allocated in row-major order
- each row in contiguous memory locations
Stepping through columns in one row:
```
for (i = 0; i < N; i++)
	sum += a[0][i];
```
- acccesses successive elements = good spatial locality
- if block size (B) > sizeof(a_ij) bytes, exploit spatial locality
  - miss rate = sizeof(a_ij) / B
- Example:
  - If a block holds 4 doubles (32 bytes), then:
    - One miss brings in 4 future elements.
    - Miss rate ≈ sizeof(double)/block_size = 8/32 = 1/4.
Stepping through rows in one column:
```
for (i = 0; i < n; i++)
	sum += a[i][0];
```
- accesses distant elements
- no spatial locality!
  - miss rate = 1 (i.e. 100%)

Check slides for "Matrix Multiplication (...)"

Core i7 Matrix Multiply Performance

Pasted image 20251114051608.png|500

We tested several different formulations of the matrix multiplication algorithm and by arranging some elements we can reduce cycles a good amount.

Using blocking to improve temporal locality

Example: Matrix Multiplication

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
	int i, j, k;
	for (i = 0; i < n; i++)
		for (j = 0; j < n; j++)
			for (k = 0; k < n; k++)
				c[i*n + j] += a[i*n + k] * b[k*n + j];
}

Lets you focus on little areas that all have good locality
What is the best blocking factor on your hardware? This is what we want to answer.

Cache Miss Analysis

Assume:
- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)
First iteration:
- n/8 + n = 9n/8 misses
- Afterwars in cache: (schematic)
Second iteration:
- Again: n/8 + n = 9n/8 misses
Total misses:
- 9n/8 * n² = (9/8) * n³

Remember: Example about prime numbers.

Blocking Summary

...

Cache Summary

Cache memories can have significant performance impact
You can write your programs to exploit this!
- Focus on the inner loops, where bulk of computations and memory accesses occur.
- Try to maximize spatial locality by reading data objects with sequentially with stride 1.
- Try to maximize temporal locality by using a data object as often as possible once it's read from memory.