Program Optimization

#programming #assembly

Class: CSCE-312

Notes:

Today

Overview
Generally Useful Optimizations
- Code motion/precomputation
- Strength reduction
- Sharing of common subexpressions
- Removing unnecessary procedure calls
Optimization Blockers
- Procedure calls
- Memory aliasing
Exploiting Instruction-Level Parallelism
Dealing with Conditionals

Overview

Performance Realities

There’s more to performance than asymptotic complexity
Constant factors matter too!
- Easily see 10:1 performance range depending on how code is written
- Must optimize at multiple levels:
  - algorithm, data representations, procedures, and loops
Must understand system to optimize performance
- How programs are compiled and executed
- How modern processors + memory systems operate
- How to measure program performance and identify bottlenecks
- How to improve performance without destroying code modularity and generality

Notes:

You have to optimize code at multiple levels:
- 1st select a good algorithm with good asymptotic complexity
- Data representations
  - You can do merge sort with lists, arrays, vectors, etc.
  - How do we layout our struct
    - It can have an impact on the size of the struct
There are tools to profile your programs
- Identify how often a section of your program is executed
- Identify what is the part of the program that is taking the most time to start optimization.
You want your code to be modular and general but at the same time you want performance

Optimizing Compilers

Provide efficient mapping of program to machine
- register allocation
- code selection and ordering (scheduling)
  - which instructions do you select and in what order do you put them in.
  - There are degrees o freedoms for the compiler to reorder things if many things do not depend of each other and the compiler realizes that.
  - scheduling = programming -> make a really good plan for executing your code.
- dead code elimination
  - Sometimes there is code that will never be reached, it was generated by the compiler to carry out some requirements of the program but it turns out that code will never be reached so the compiler realizes it and just eliminates it.
- eliminating minor inefficiencies
  - Target at the beginning of a cache block vs. to the middle of a cache block
  - Try to align to large power of 2 boundaries.
Don’t (usually) improve asymptotic efficiency
- up to programmer to select best overall algorithm
- big-O savings are (often) more important than constant factors
  - but constant factors also matter
Have difficulty overcoming “optimization blockers”
- potential memory aliasing
- potential procedure side-effects
- "Things that the compiler would like to optimize but it is preventing from doing so because it does not have context on what you really wanted with your code" -> compilers are not expressive enough.
  - That is why we have interfaces like C++ commands.

Notes:

Can the compiler change the asymptotic complexity of your code?
- Yes, it could, it is scary how sometimes the compiler is super smart.
- GCC these days would take code, recognize it, and replace it with complex mathematical functions that result in code optimization.

Limitations of Optimizing Compiler

Operate under fundamental constraint
- Must not cause any change in program behavior
  - Except, possibly when program making use of nonstandard language features
- Often prevents it from making optimizations that would only affect behavior under pathological conditions.
  - Example: "What if one of the arrays overlap another one, then optimization won't work"
Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
- e.g., Data ranges may be more limited than variable types suggest
- Example: put things into lookup tables if you can tell it the range.
Most analysis is performed only within procedures
- Whole-program analysis is too expensive in most cases
- Newer versions of GCC do interprocedural analysis within individual files
  - But, not between code in different files
  - Within a procedure you can rearrange things and optimize but within 2 procedures this is much more complex, it requires more analysis.
    - Example: What if one procedure calls another procedure which calls another procedure in another file that we do not have access to?
Most analysis is based only on static information
- Compiler has difficulty anticipating run-time inputs
  - Compiler doesn't know what values, variables are going to take in.
  - Example: In switch instruction, if the compiler knew which case was the most frequent, it could make something special to make this code faster.
When in doubt, the compiler must be conservative
- The compiler can't violate any assumptions and do optimizations that may not work.
- The optimizer must not break your code.

Generally Useful Optimization

Useful Optimization

Optimizations that you or the compiler should do regardless of processor / compiler
Code Motion
- Reduce frequency with which computation performed
  - If it will always produce same result
  - Especially moving code out of loop

Pasted image 20251030132005.png|600

n*i brings us to the i row of the matrix a
n*i is redundant, it is the same value everytime through the loop
It is a no-brainer to move this redundant computation out of the loop, this will make the loop faster!
Now instead of performing a costly multiplication on every iteration, we are just moving something to a register.
The compiler would do this very thing.
- How would you know as the programmer if the compiler could do that? Should you let this optimization to the compiler?
  - Solution: Look at the Assembly code! this is the place you can look for code optimizations made by the compiler, compare them with your own.

Compiler-Generated Code Motion (-O1)

Pasted image 20251030132305.png|600

Computes address in a row and increments it
Instead of doing indexing, all we have to do is increment a pointer which is faster and more efficient.
Note since we are using doubles you can see the lovely %xmm0 register.
This is probably about the most efficient code to implement this functionality.

Reduction in Strength

Replace costly operation with simpler one
Shift, add instead of multiply or divide
- 16*x --> x << 4
  - Shifting is a much cheaper operation than multiplication
    - Utility machine dependent
    - Depends on cost of multiply or divide instruction
      - On Intel Nehalem, integer multiply requires 3 CPU cycles
    - Recognize sequence of products

Before optimization:

for (i = 0; i < n; i++) {
  int ni = n*i;
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
}

Copying a vector to different rows of a matrix

After optimization:

int ni = 0;
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
  ni += n;
}

Instead of multiplying we repeatedly sum
- Replacing multiplication with addition
- Multiplication = 3 cycles
- Addition = 1 cycle

Reuse portions of expressions
GCC will do this with –O1

Pasted image 20251030133110.png|600

Top-left:
- Notice each one of these indices is a complicated expression
- But notice there are some common subexpressions
- The compiler can figure out how to do some algebra to factor out common subexpressions and then re-write expressions in terms of common sub-expressions so instead of many multiplications you just have one multiplication.

Optimizing Blockers

Optimization Blocker #1: Procedure Calls

Procedure to Convert String to Lower Case

void lower(char *s)
{
  size_t i;
  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

strlen(s) computes the length of the string (char array)
What is the asymptotic complexity of this function?
- strlen(s) checks in O(n) time and you are doing it for each element in s
- so this function is O(n^2)
Compiler doesn't really know strlen(s) so you have to make optimizations yourself in this case.
- It might know that it is not modifying the string but other than that it does not know what it does.

Lower Case Conversion Performance

Time quadruples when double string length
Quadratic performance

Pasted image 20251030133952.png|450

You should't take that long to process half a million characters!

Improving Performance

void lower(char *s)
{
  size_t i;
  size_t len = strlen(s);
  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

Move call to strlen() outside of loop
Since result does not change from one iteration to another
This is a form of code motion

Lower Case Conversion Performance

Time doubles when double string length
Linear performance of lower2

Pasted image 20251030134125.png|450

Optimization Blocker: Procedure Calls

Why couldn’t compiler move strlen out of inner loop?
- Procedure may have side effects
  - Alters global state each time called
- Function may not return same value for given arguments
  - Depends on other parts of global state
  - Procedure lower could interact with strlen
Warning:
- Compiler treats procedure call as a black box
- Weak optimizations near them
Remedies:
- Use of inline functions
  - GCC does this with –O1
    - Within single file
- Do your own code motion

Notes:

Function inlining:
- If function is sufficiently small the compiler saves itself the overhead of calling a function a literally just replace the call instruction with the actual code of the function
- we cannot do this with very big functions because the size of the programs would blow up.
- Allows us to make better inter-procedural analysis.

size_t lencnt = 0;
size_t strlen(const char *s)
{
    size_t length = 0;
    while (*s != '\0') {
		s++; length++;
    }
    lencnt += length;
    return length;
}

Memory Matters

C code:

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
		b[i] = 0;
		for (j = 0; j < n; j++)
		    b[i] += a[i*n + j];
	}
}

Assembly code:

# sum_rows1 inner loop
.L4:
        movsd   (%rsi,%rax,8), %xmm0	# FP load
        addsd   (%rdi), %xmm0		# FP add
        movsd   %xmm0, (%rsi,%rax,8)	# FP store
        addq    $8, %rdi
        cmpq    %rcx, %rdi
        jne     .L4

Code updates b[i] on every iteration
Why couldn’t compiler optimize this away?
Vector b overlaps matrix a
- If we modify b it will be reflected in a later iteration of the loop
The compiler has to assume that any pointer could be another name (alias)
The compiler just can't do that kind of code motion, you have to do it.
There area "annotations" to tell the compiler what you want in fortran.

Memory Aliasing

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
		b[i] = 0;
		for (j = 0; j < n; j++)
		    b[i] += a[i*n + j];
	}
}

double A[9] = 
  { 0,   1,   2,
    4,   8,  16},
   32,  64, 128};

double B[3] = A+3;

sum_rows1(A, B, 3);

No-one would do it this way but the compiler doesn't know that!
Code updates b[i] on every iteration
Must consider possibility that these updates will affect program behavior

Removing Aliasing

C code:

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows2(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
		double val = 0;
		for (j = 0; j < n; j++)
		    val += a[i*n + j];
		b[i] = val;
	}
}

Assembly code:

# sum_rows2 inner loop
.L10:
        addsd   (%rdi), %xmm0	# FP load + add
        addq    $8, %rdi
        cmpq    %rax, %rdi
        jne     .L10

No need to store intermediate results
Now within this loop there is only one array access instead of two.

Optimization Blocker: Memory Aliasing

Aliasing
- Two different memory references specify single location
- Easy to have happen in C
  - Since allowed to do address arithmetic
  - Direct access to storage structures
- Get in habit of introducing local variables
  - Accumulating within loops
  - Your way of telling compiler not to check for aliasing

Exploiting Instruction-Level Parallelism

Need general understanding of modern processor design
- Hardware can execute multiple instructions in parallel
Performance limited by data dependencies
Simple transformations can yield dramatic performance improvement
- Compilers often cannot make these transformations
- Lack of associativity and distributivity in floating-point arithmetic

Notes:

Associativity and these other properties do not actually work in x86, you have to optimize them yourself.

Benchmark Example: Data Type for Vectors

/* data structure for vectors */
typedef struct{
	size_t len;
	data_t *data;
} vec;

The type data_t is a type alias — you can redefine it to test different numeric types (int, float, double, etc.) to see how data type affects performance.

Pasted image 20251103230704.png|300

/* retrieve vector element
   and store at val */
int get_vec_element(*vec v, size_t idx, data_t *val) {
	if (idx >= v->len)
		return 0;           // Out of bounds
	*val = v->data[idx];    // Read the value into *val
	return 1;
}

It looks simple, but notice how it adds function call overhead and bounds checks for each element access. When used in a tight loop, these extra steps can heavily affect performance.

Data Types

Use different declarations for data_t
- int
- long
- float
- double

Benchmark Computation

Compute sum or product of vector elements

void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
		data_t val;
		get_vec_element(v, i, &val);
		*dest = *dest OP val;
    }
}

How it works:

This function loops over all vector elements and combines them into one scalar.
OP and IDENT are macros that change depending on the operation:
- If you’re summing → OP = + and IDENT = 0.
- If you’re multiplying → OP = * and IDENT = 1.

Why we benchmark it?

It’s a controlled, repeatable workload that stresses:
- Memory access patterns (v->data[i])
- Function call overhead (get_vec_element)
- Arithmetic throughput (OP)
We can time it for different n (vector lengths) and data types to measure how efficiently the CPU handles the operation per element.

Cycles Per Element (CPE)

Convenient way to express performance of program that operates on vectors or lists
Length = n
In our case: CPE = cycles per OP
T = CPE*n + Overhead
- T = total cycles (execution time * CPU frequency)
- CPE is slope of line (CPE = slope)
- n = number of elements
- Overhead = fixed cost (function setup, loop entry/exit, etc.)

Pasted image 20251030135724.png|450

The function might have had 9 cycles per element, but after optimizations (like inlining or better memory access), it goes down to 6 cycles per element.
Note: A smaller slope means better performance

Benchmark Performance

void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
		data_t val;
		get_vec_element(v, i, &val);
		*dest = *dest OP val;
    }
}

Compute sum or product of vector elements

Method	Integer		Double FP
Operation	Add	Mult	Add	Mult
Combine1 unoptimized	22.68	20.02	19.98	20.18
Combine1 –O1	10.12	10.12	10.17	11.14

Basic Optimizations

void combine4(vec_ptr v, data_t *dest)
{
	long i;
	long length = vec_length(v);
	data_t *d = get_vec_start(v);
	data_t t = IDENT;
	for (i = 0; i < length; i++)
		t = t OP d[i];
	*dest = t;
}

Move vec_length out of loop
Avoid bounds check on each cycle
Accumulate in temporary

Method	Integer		Double FP
Operation	Add	Mult	Add	Mult
Combine1 –O1	10.12	10.12	10.17	11.14
Combine4	1.27	3.01	3.01	5.01

Eliminates sources of overhead in loop

Not on Exam 3.

Modern CPU Design (Ex4)

Pasted image 20251106125750.png|500

We can make these things out of Digital Logic Design.
This is how a computer is organized
Separate into two main things
- Instruction control
  - Fetch: figures how/what instructions suppose to do
  - Uses IP (Instruction Pointer) to get Instructions from the instruction cache
  - Keep instructions in the instruction cache
  - Instructions go from the Cache to the Instruction Decode -> figures out what the instructions are suppose to do.
  - There will be a branch predictor -> predict what instruction will be next -> signal goes back to the instruction control function for confirmation.
    - Receives signal if the branch was predicted correctly and also was the target correct? (predict where it goes)
      - The branch predictor has to both predict the instruction and the target (where it goes).
    - Branch targets are cached, the branch predictor maybe just uses this and assumes it is correct but it may actually be incorrect.
- Execution
  - Data cache: stores recently loaded data
  - Functional units -> very complex
Array of registers is accessed by Control and Execution branches.

Superscalar Processor

Definition: A superscalar processor can issue and execute multiple instructions in one cycle****. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.**
- Instructions are retrieved from a sequential structure frame
- To you (looking at the assembly language) like they execute one at a time, but the processor figures out how to execute them many at a time, enabling parallelism.
- If one in the middle of the process is a jump instruction, then the branch predictor comes in.
  - We stop the instructions beyond the jump, and start fetching instructions for the branch and evaluate with branch predictor
- We want a branch to fall through as often as possible (branch condition is usually true -> we want this branch to typically be taken)
- Fetch block: instructions that are fetch in one cycle -> make it so that if there is a taken branch, that is the last instruction in that fetch block.
Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have.
- The independence between instructions that most programs have
Most modern CPUs are superscalar.
- Intel Atom was a processor that was single-issue, that is the only example of a non superscalar CPU
Intel: since Pentium (1993)

Pipelined Functional Units

long mult_eg(long a, long b, long c) {
    long p1 = a*b;
    long p2 = a*c;
    long p3 = p1 * p2;
    return p3;
}

a * b and a * c are independent, they do not meet each other -> you can compute them in parallel.
But p1 * p2 depends on both of them → it must wait until both finish.

	1	2	3	4	5	6	7
Stage 1	*ab**	*ac**			*p1p2**
Stage 2		*ab**	*ac**			*p1p2**
Stage 3			*ab**	*ac**			*p1p2**

Top row is "Time"

The computation process in a pipeline way, like in assembly.

Divide computation into stages
Pass partial computations from stage to stage
Stage i can start on new computation once values passed to i+1
E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles
- A multiplication takes 3 cycles, but we can begin a new multiplication in each cycles so we can complete 3 multiplications on these 3 cycles without having to require more cycles, this is because of pipelining.

Pasted image 20251106132046.png|127

Haswell CPU

8 Total Functional Units

Multiple instructions can execute in parallel

2 load, with address computation
1 store, with address computation
4 integer
2 FP multiply
1 FP add
1 FP divide

Some instructions take > 1 cycle, but can be pipelined

Instruction	                 Latency	Cycles/Issue
Load / Store	                   4	           1
Integer Multiply	               3	           1
Integer/Long Divide	            3-30	        3-30
Single/Double FP Multiply	       5	           1
Single/Double FP Add	           3	           1
Single/Double FP Divide	        3-15	        3-15

Note leaq is (computing an address) is completely separate, it is not computed this way
Overlapping the execution to exploit some parallelism

Notes:

Divide is very costly, sometimes we use it as multiply

x86-64 Compilation of Combine4

Inner Loop (Case: Integer Multiply)

Pasted image 20251106132756.png|450

Latency = the time an instruction takes.
How long does each instruction take -> then add that up
- This is not right, multiply is pipelined.
- Some instructions can go completely in parallel, and some rely on a previous instruction.
- It doesn't really take 3 cycles to do the multiply -> it is faster because of parallelism

Combine4 = Serial Computation (OP = `*`)

Computation (length=8)

x = ((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) * d[4]) * d[5]) * d[6]) * d[7])

Sequential dependence
- Performance: determined by latency of OP

Pasted image 20251123225249.png|200

x0 = IDENTITY
x1 = x0 * d[0]
x2 = x1 * d[1]
x3 = x2 * d[2]
...

Each multiply depends on the result of the previous multiplication.
Even if multipliers are pipelined, dependency breaks parallelism.
- Latency of a (pipelined) multiply = ~3 cycles
- Throughput = 1 multiply per cycle
- BUT because each multiply depends on the previous one, we still must respect its latency.

Loop unrolling (2x1)

Loop unrolling does two main things:
1. Reduces branches & overhead
  - Good for small operations like integer addition.
2. Gives compiler more scheduling space
  - Instructions can be rearranged to use parallel functional units.
But loop unrolling does NOT automatically eliminate dependency chains.

void unroll2a_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    long i;
    
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
		x = (x OP d[i]) OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
		x = x OP d[i];
    }
    *dest = x;
}

Perform 2x more useful work per iteration
We can replace the code for that loop -> this is called loop unrolling
Why would you do that? -> to avoid a branch
The actual instructions that result from that can be move and schedule to exploit instruction parallelism

Effect of Loop Unrolling

Method	Integer		Double FP
Operation	Add	Mult	Add	Mult
Combine4	1.27	3.01	3.01	5.01
Unroll 2x1	1.01	3.01	3.01	5.01
Latency Bound	1.00	3.00	3.00	5.00

Helps integer add
- Achieves latency bound
Others don’t improve. Why?
- Still sequential dependency

x = (x OP d[i]) OP d[i+1];

This still expands to:

x = (((...((x * d[i]) * d[i+1])) ...)

Still dependent on x
Even though two elements are processed per loop iteration:
- (x OP d[i]) depends on x
- ( (x OP d[i]) OP d[i+1] ) depends on both
- Next iteration depends on x again

Loop Unrolling with Reassociation (2x1a)

void unroll2aa_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {    // Compare to before:
		x = x OP (d[i] OP d[i+1]);    // x = (x OP d[i]) OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
		x = x OP d[i];
    }
    *dest = x;
}

Can this change the result of the computation?
Yes, for FP. Why?

What changed?

Before:
- Step A: t1 = x OP d[i] (depends on x)
- Step B: x = t1 OP d[i+1] (depends on t1)
After:
- Step A: t = d[i] OP d[i+1] (independent of x)
- Step B: x = x OP t (depends on previous x)

Why this breaks part of the dependency chain

d[i] OP d[i+1] can start early, before x is ready
Next iteration can start computing (d[i+2] OP d[i+3]) in parallel
The x update still depends on old x, but now half the work is off the dependency chain.

Why FP results change

Floating-point is not associative:
- (a + b) + c ≠ a + (b + c)
- (a * b) * c ≠ a * (b * c)
Reordering changes rounding.

Effect of Reassociation

Method	Integer		Double FP
Operation	Add	Mult	Add	Mult
Combine4	1.27	3.01	3.01	5.01
Unroll 2x1	1.01	3.01	3.01	5.01
Unroll 2x1a	1.01	1.51	1.51	2.51
Latency Bound	1.00	3.00	3.00	5.00
Throughput Bound	0.50	1.00	1.00	0.50

4 func. units for int + 2 func. units for load
2 func. units for FP * 2 func. units for load
Nearly 2x speedup for Int *, FP +, FP *
- Reason: Breaks sequential dependency
  - x = x OP (d[i] OP d[i+1]);
- Why is that?
What changed:
- Ops in the next iteration can be started early (no dependency)
Overall Performance
- N elements, D cycles latency/op
- (N/2+1) * D cycles: CPE = D/2

Pasted image 20251123230115.png|200

Loop Unrolling with Separate Accumulators (2x2)

void unroll2a_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x0 = IDENT;
    data_t x1 = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
       x0 = x0 OP d[i];
       x1 = x1 OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
		x0 = x0 OP d[i];
    }
    *dest = x0 OP x1;
}

Different form of reassociation

What does this do?

This is the key idea:
- x0 handles even elements
- x1 handles odd elements
Now we have two independent accumulator chains:
- x0 = x0 * d[0] → * d[2] → * d[4] ...
- x1 = x1 * d[1] → * d[3] → * d[5] ...

Why this is powerful

x0 and x1 do not depend on each other
CPU can run them in different functional units
Two multiply pipelines → we get almost double throughput.

Effect of Separate Accumulators

Method	Integer		Double FP
Operation	Add	Mult	Add	Mult
Combine4	1.27	3.01	3.01	5.01
Unroll 2x1	1.01	3.01	3.01	5.01
Unroll 2x1a	1.01	1.51	1.51	2.51
Unroll 2x2	0.81	1.51	1.51	2.51
Latency Bound	1.00	3.00	3.00	5.00
Throughput Bound	0.50	1.00	1.00	0.50

Int + makes use of two load units
- x0 = x0 OP d[i];
- x1 = x1 OP d[i+1];
2x speedup (over unroll2) for Int *, FP +, FP *
What changed:
- Two independent “streams” of operations
Overall Performance
- N elements, D cycles latency/op
- Should be (N/2+1) * D cycles: CPE = D/2
- CPE matches prediction!
  - CPE = 3 / 2 = 1.5 cycles per element

Pasted image 20251123233856.png|225

Unrolling & Accumulating

Idea
- Can unroll to any degree L
- Can accumulate K results in parallel
- L must be multiple of K
Limitations
- Diminishing returns
  - Too many accumulators = diminishing returns
- Cannot go beyond throughput limitations of execution units
- Large overhead for short lengths
  - Short arrays → overhead not worth it
- Finish off iterations sequentially

What About Branches?

Challenge
- Instruction Control Unit must work well ahead of Execution Unit to generate enough operations to keep EU busy
- When encounters conditional branch, cannot reliably determine where to continue fetching

Pasted image 20251123234746.png|500

Branch Outcomes

When encounter conditional branch, cannot determine where to continue fetching
- Branch Taken: Transfer control to branch target
- Branch Not-Taken: Continue with next instruction in sequence
Cannot resolve until outcome determined by branch/integer unit

Pasted image 20251106133424.png|500

Notes:

With conditional branches we do not know if we have to jump, those conditions are based on other instructions that are pipelined somewhere else.
The only thing to do then is to wait until all the instructions this branch depends on are executed.
- From 5 to 300 cycles
- This makes the pipeline cycle worthless

Branch Prediction

Idea
- Guess which way branch will go
- Begin executing instructions at predicted position
  - But don’t actually modify register or memory data

Pasted image 20251106133628.png|500

Notes:

When we predict a branch, we begin a speculation -> computing values but not actually putting them in register files or storing them in memory that other processes can see, we are just keeping those in the branch
Then once we confirm the branch will be taken/not taken, then we easily put what we loaded into registers/memory (updated)
If the prediction is incorrect,(the branch was supposedly to be taken) -> we toss out all those values that we have computed speculately and then no one ever see those values
- This is very costly but we maintain correct program execution.
- "educated guess"
- the more accurate it can be, the faster our process may be

Branch Prediction Through Loop

Pasted image 20251106133951.png|600

Unrolling the loop inside the processor
With branch prediction we keep fetching over and over again and have different instances of the loop body already ready to be verified

Branch Misprediction Invalidation

Pasted image 20251106134123.png|550

Branch Misprediction Recovery

Pasted image 20251106134147.png|550

Performance Cost
- Multiple clock cycles on modern processor
- Can be a major performance limiter
- Cost: all the time we spent going through the wrong branch
  - reduce risk by making branches easier to predict

Getting High Performance

Good compiler and flags
Don’t do anything stupid
- Watch out for hidden algorithmic inefficiencies
- Write compiler-friendly code
  - Watch out for optimization blockers:  procedure calls & memory references
- Look carefully at innermost loops (where most work is done)
  - Care about 2D array access in row-major order (it is harder to access a different row than a column in the same row)
Tune code for machine
- Exploit instruction-level parallelism
- Avoid unpredictable branches
  - Try not to do data dependent conditional jumps
- Make code cache friendly (Covered later in course)

Notes:

Is there any case where you do not want any code optimizations?
- For debugging -> optimization makes it hard to tell what is actually going on
- The debugger can figure out exactly which line of code caused the problem.
- There are cases where optimization actually reveals bugs in the original code
  - Maybe an alignment thing
Doesn't it change some math operations?
- It really shouldn't
- The compiler must respect a standard order of operations that is consistent with what a programmer specifies in an operation.
Note there exist compiler bugs where the compiler actually did soemthing wrong but these are extremely rara
- in 15 years of programming it happned 3 times to professor Jimenez

Today

Overview

Performance Realities

Optimizing Compilers

Limitations of Optimizing Compiler

Generally Useful Optimization

Useful Optimization

Compiler-Generated Code Motion (-O1)

Reduction in Strength

Share Common Subexpressions

Optimizing Blockers

Optimization Blocker #1: Procedure Calls

Lower Case Conversion Performance

Improving Performance

Lower Case Conversion Performance

Optimization Blocker: Procedure Calls

Memory Matters

Memory Aliasing

Removing Aliasing

Optimization Blocker: Memory Aliasing

Exploiting Instruction-Level Parallelism

Exploiting Instruction-Level Parallelism

Benchmark Example: Data Type for Vectors

Benchmark Computation

Cycles Per Element (CPE)

Benchmark Performance

Basic Optimizations

Modern CPU Design (Ex4)

Superscalar Processor

Pipelined Functional Units

Haswell CPU

x86-64 Compilation of Combine4

Combine4 = Serial Computation (OP = *)

Loop unrolling (2x1)

Effect of Loop Unrolling

Loop Unrolling with Reassociation (2x1a)

Effect of Reassociation

Loop Unrolling with Separate Accumulators (2x2)

Effect of Separate Accumulators

Unrolling & Accumulating

What About Branches?

Branch Outcomes

Branch Prediction

Branch Prediction Through Loop

Branch Misprediction Invalidation

Branch Misprediction Recovery

Getting High Performance

Combine4 = Serial Computation (OP = `*`)