Cache Memories
Class: CSCE-312
Notes:
Today
- Cache memory organization and operation
- Performance impact of caches
- The memory mountain
- Rearranging loops to improve spatial locality
- Using blocking to improve temporal locality
Cache memory organization and operation
Example Memory Hierarchy
/CSCE-312/Visual%20Aids/Pasted%20image%2020251111135154.png)
Cache Memories
- Cache memories are small, fast SRAM-based memories managed automatically in hardware
- Hold frequently accessed blocks of main memory
- CPU looks first for data in cache
- Typical system structure:
/CSCE-312/Visual%20Aids/Pasted%20image%2020251114043018.png)
General Cache Organization (S, E, B)
...
![[Cache Memories.pdf]]
Cache Size: C = S * E * B data bytes
Structure of the Cache
Sets (S = 2ˢ)
- The cache is divided into S sets.
- A set is like a “bucket” that can hold multiple cache lines.
- The set index bits of an address determine which set a particular memory block maps to.
Lines per Set (E = 2ᵉ)
- Each set contains E lines (also known as “ways”).
- These are the slots that can hold cached blocks.
- Determines associativity:
- E = 1 → direct-mapped (1 line per set)
- E = 2, 4, 8 → set associative
- E = number of lines in entire cache → fully associative
- Pattern:
- More lines per set → fewer conflict misses → more hardware complexity.
Block Size (B = 2ᵇ bytes per block)
- Each cache line stores B bytes of contiguous memory.
- B is also called the cache block size or cache line size.
- Common sizes: 32B, 64B, 128B.
- A bigger block = better spatial locality, but more wasted bytes on misses.
A Cache Line Contains:
A single cache line has the following fields:
1. Valid bit (v)
- Indicates whether the line contains meaningful data.
- v = 0 → line is empty/invalid
- v = 1 → line contains a real memory block
2. Tag
- Used to identify which memory block is stored in this line.
- The upper bits of the address (tag bits) help match the requested block.
3. Data block (B bytes)
- The actual bytes fetched from memory.
- Indexed using the block offset bits of the address.
Cache Read
- Locate set
- Check if any line in set has matching tag
- Yes + line valid: hit
- Locate data starting at offset
Notes:
- Divide the address that the machine generates into sets
Address of word:
- t bits: tag
- s bits: set index
- b bits: block offset
- Data begins at this offset
MISS
HIT
Address Breakdown
Every memory address is divided into three parts:
- Tag (t bits)
- Identifies which memory block is stored in a cache line.
- Compared against all lines in the selected set.
- Set index (s bits)
- Selects one of the S = 2ˢ sets in the cache.
- Determines exactly which bucket of lines to look inside.
- Block offset (b bits)
- Locates the exact byte within the B = 2ᵇ-byte block stored in the cache line.
- Low-order bits of the address.
Address of word:
| tag | set index | block offset |
| t bits | s bits | b bits |
Cache Lookup Process (Read Operation)
- Step 1 — Locate the Set
- Use the set index bits to pick one set:
set = address[set index bits]- Only this set could possibly contain the desired block.
- Step 2 — Search the Lines Within the Set
- Each set contains E lines (E = 2ᵉ ways).
- For every line in the set, check:
- Valid bit = 1?
- Tag matches the tag bits of the address?
- If both are true → we found the block.
- Step 3 — Determine Hit or Miss
- Hit
- At least one line in the set has matching tag and valid bit = 1.
- The cache returns the data immediately.
- Miss
- No matching tag found in any line.
- Cache must fetch the block from the next memory hierarchy level (L2, L3, DRAM).
- Hit
- Step 4 — Locate the Exact Byte Using Block Offset
- If the tag matched (hit):
- Use the block offset bits to index into the B-byte data block.
- Example:
- If block size B = 32 bytes, offset selects byte 0–31.
- If the tag matched (hit):
Check slides for "Example: Direct Mapped Cache (E = 1)"
- If tag doesn't match: old line is evicted and replaced
Check slides for "Direct-Mapped Cache Simulation"
Check slides for "E-way Set Associative Cache (Here: E = 2)"
- A cache that is divided into sets of multiple blocks -> a set associate cache
- Allows us to have a lower miss rate
- If 2 addresses map to the same set of the cache we can work with that using "associativity".
Check slides for "2-Way Set Associative Cache Simulation"
What about writes?
- Multiple copies of data exist:
- L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
- Write-through (write immediately to memory)
- Write-back (defer write to memory until replacement of line)
- Need a dirty bit (line different from memory or not)
- What to do on a write-miss?
- Write-allocate (load into cache, update line in cache)
- Good if more writes to the location follow
- No-write-allocate (writes straight to memory, does not load into cache)
- Write-allocate (load into cache, update line in cache)
- Typical
- Write-through + No-write-allocate
- Write-back + Write-allocate
Intel Core i7 Cache Hierarchy
-
L1 i-cahce and d-cache:
- 32 KB, 8-way
- Access: 4 cycles
-
L2 unified cache:
- 256 KB, 8-way
- Access: 10 cycles
-
L3 unified cache:
- 8 MB, 16-way
- Access: 40-75 cycles
-
Block size: 64 bytes for all caches
/CSCE-312/Visual%20Aids/Pasted%20image%2020251114044631.png)
- i-cache: instructions
- d-cache: data
- Block size, E and number of sets is all you need to determine the capacity of sets.
- Word size does not actually matter because the fundamental unit of a cache is blocks.
- Dealing with words is something we do below the level of the cache
- Word is the fundamental atomic unit that the processor deal with internally -> think of it as fractions of words (bytes)
- Block size is the fundamental unit for memory and caches
- Memory comes in pages, those are the main unit when we talk about virtual memory
- Sectors is a unit for disks
Cache Performance Metrics
- Miss Rate
- Fraction of memory references not found in cache (misses / accesses) = 1 - hit rate
- Typical numbers (in percentages):
- 3-10% for L1
- can be quite small (e.g., < 1%) for L2, depending on size, etc.
- Hit Time
- Time to deliver a line in the cache to the processor
- Includes time to determine whether the line is in the cache
- Typical numbers:
- 4 clock cycle for L1
- 10 clock cycles for L2
- Time to deliver a line in the cache to the processor
- Miss Penalty
- Additional time required because of a miss
- typically 50-200 cycles for main memory (Trend: increasing!)
- Additional time required because of a miss
Notes:
- Miss rate can be small for large caches
- A miss can be much more costly than a hit
- Every time we do a hit we do that roundtrip time
- In a miss we have to go to the next level of memory where that hit time is grater, it takes longer, and gets worse and worse as you go down hierarchies.
- The trend in recent years is for this to be increasing.
Let's think about those numbers
-
Huge difference between a hit and a miss
- Could be 100x, if just L1 and main memory
-
Would you believe 99% hits is twice as good as 97%?
- Consider:
- cache hit time of 1 cycle
- miss penalty of 100 cycles
- Average Access Time:
- AMAT = Hit time + Miss rate × Miss penalty
- Example:
- 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
- The expected value of the average access time is given by this equation
- We hit memory in 1 cycle
- The miss takes 100 cycles
- The average access time is 1 cycle + a 3% chance of the 100 cycles miss.
- 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
- 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
- Consider:
-
This is why "miss rate" is used instead of "hit rate"
Notes:
- Examples in code:
- Generally minimization of this is in the inner loops
- Try to go sequentially accessing through the cache -> remember spatial locality
- If we access the very next word, that is in the cache, and the next bytes are also in the cache and so on. (things we know there are in the cache because of spatial locality)
Writing Cache Friendly Code
-
Make the common case go fast
- Focus on the inner loops of the core functions
-
Minimize the misses in the inner loops
- Repeated references to variables are good (temporal locality)
- Stride-1 reference patterns are good (spatial locality)
-
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories
Performance impact of caches
The Memory Mountain
- Read throughput (read bandwidth)
- Number of bytes read from memory per second (MB/s)
- Memory mountain: Measured throughput as a function of spatial and temporal locality
- Compact way to characterize memory system performance.
Memory Mountain Test Function
mountain/mountain.c
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
* array “data” with stride of "stride", using
* using 4x4 loop unrolling.
*/
int test(int elems, int stride) {
long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
long length = elems, limit = length - sx4;
/* Combine 4 elements at a time */
for (i = 0; i < limit; i += sx4) {
acc0 = acc0 + data[i];
acc1 = acc1 + data[i+stride];
acc2 = acc2 + data[i+sx2];
acc3 = acc3 + data[i+sx3];
}
/* Finish any remaining elements */
for (; i < length; i++) {
acc0 = acc0 + data[i];
}
return ((acc0 + acc1) + (acc2 + acc3));
}
- Call
test()with many combinations ofelemsandstride - For each
elemsandstride:- Call
test()once to warm up the caches - Call
test()again and measure the read throughput (MB/s)
- Call
/CSCE-312/Visual%20Aids/Pasted%20image%2020251114050215.png)
Rearranging loops to improve spatial locality
Matrix Multiplication Example
matmult/mm.c
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0; // variable `sum` held in register
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
-
Canonical example of well vs. poorly written code
-
Matrix multiplication is very common in the world
-
Description:
- Multiply NxN matrices
- Matrix elements are doubles (8 bytes)
- O(N^3) total operatons
- N reads per source element
- N values summed per destination
- but may be able to hold in register
Notes:
- You have to take the dot product of each column of one matrix with each row of the other matrix.
- Dot product: take the pairwise product of elements
- Any faster algorithms for Matrix multiplication?
- Strassus algorithm: Rearranges things to save a few multiplications
- It is asymptotically better than O(N^3)
- Strassus algorithm: Rearranges things to save a few multiplications
- We are accessing
awith good spatial locality -> going one by one through each row ofa(accessingasequentially) - Poor spatial locality is each time we change
ksince we are going to a different row in the matrix.
Miss Rate Analysis for Matrix Multiply
- Assume:
- Block size = 32 MB (big enough for four doubles)
- Matrix dimension (N) is very large
- Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows
- Analysis Method:
- Look at access pattern on inner loop
/CSCE-312/Visual%20Aids/Pasted%20image%2020251114050846.png)
Layout of C Arrays in Memory (review)
- C arrays allocated in row-major order
- each row in contiguous memory locations
- Stepping through columns in one row:
for (i = 0; i < N; i++) sum += a[0][i];- acccesses successive elements = good spatial locality
- if block size (B) > sizeof(aij) bytes, exploit spatial locality
- miss rate = sizeof(aij) / B
- Example:
- If a block holds 4 doubles (32 bytes), then:
- One miss brings in 4 future elements.
- Miss rate ≈ sizeof(double)/block_size = 8/32 = 1/4.
- If a block holds 4 doubles (32 bytes), then:
- Stepping through rows in one column:
for (i = 0; i < n; i++) sum += a[i][0];- accesses distant elements
- no spatial locality!
- miss rate = 1 (i.e. 100%)
Check slides for "Matrix Multiplication (...)"
Core i7 Matrix Multiply Performance
/CSCE-312/Visual%20Aids/Pasted%20image%2020251114051608.png)
- We tested several different formulations of the matrix multiplication algorithm and by arranging some elements we can reduce cycles a good amount.
Using blocking to improve temporal locality
Example: Matrix Multiplication
c = (double *) calloc(sizeof(double), n*n);
/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
c[i*n + j] += a[i*n + k] * b[k*n + j];
}
- Lets you focus on little areas that all have good locality
- What is the best blocking factor on your hardware? This is what we want to answer.
Cache Miss Analysis
- Assume:
- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)
- First iteration:
- n/8 + n = 9n/8 misses
- Afterwars in cache: (schematic)
- Second iteration:
- Again: n/8 + n = 9n/8 misses
- Total misses:
- 9n/8 * n2 = (9/8) * n3
Remember: Example about prime numbers.
Blocking Summary
...
Cache Summary
- Cache memories can have significant performance impact
- You can write your programs to exploit this!
- Focus on the inner loops, where bulk of computations and memory accesses occur.
- Try to maximize spatial locality by reading data objects with sequentially with stride 1.
- Try to maximize temporal locality by using a data object as often as possible once it's read from memory.