Program Optimization
Class: CSCE-312
Notes:
Today
- Overview
- Generally Useful Optimizations
- Code motion/precomputation
- Strength reduction
- Sharing of common subexpressions
- Removing unnecessary procedure calls
- Optimization Blockers
- Procedure calls
- Memory aliasing
- Exploiting Instruction-Level Parallelism
- Dealing with Conditionals
Overview
Performance Realities
- There’s more to performance than asymptotic complexity
- Constant factors matter too!
- Easily see 10:1 performance range depending on how code is written
- Must optimize at multiple levels:
- algorithm, data representations, procedures, and loops
- Must understand system to optimize performance
- How programs are compiled and executed
- How modern processors + memory systems operate
- How to measure program performance and identify bottlenecks
- How to improve performance without destroying code modularity and generality
Notes:
- You have to optimize code at multiple levels:
- 1st select a good algorithm with good asymptotic complexity
- Data representations
- You can do merge sort with lists, arrays, vectors, etc.
- How do we layout our struct
- It can have an impact on the size of the struct
- There are tools to profile your programs
- Identify how often a section of your program is executed
- Identify what is the part of the program that is taking the most time to start optimization.
- You want your code to be modular and general but at the same time you want performance
Optimizing Compilers
- Provide efficient mapping of program to machine
- register allocation
- code selection and ordering (scheduling)
- which instructions do you select and in what order do you put them in.
- There are degrees o freedoms for the compiler to reorder things if many things do not depend of each other and the compiler realizes that.
- scheduling = programming -> make a really good plan for executing your code.
- dead code elimination
- Sometimes there is code that will never be reached, it was generated by the compiler to carry out some requirements of the program but it turns out that code will never be reached so the compiler realizes it and just eliminates it.
- eliminating minor inefficiencies
- Target at the beginning of a cache block vs. to the middle of a cache block
- Try to align to large power of 2 boundaries.
- Don’t (usually) improve asymptotic efficiency
- up to programmer to select best overall algorithm
- big-O savings are (often) more important than constant factors
- but constant factors also matter
- Have difficulty overcoming “optimization blockers”
- potential memory aliasing
- potential procedure side-effects
- "Things that the compiler would like to optimize but it is preventing from doing so because it does not have context on what you really wanted with your code" -> compilers are not expressive enough.
- That is why we have interfaces like C++ commands.
Notes:
- Can the compiler change the asymptotic complexity of your code?
- Yes, it could, it is scary how sometimes the compiler is super smart.
- GCC these days would take code, recognize it, and replace it with complex mathematical functions that result in code optimization.
Limitations of Optimizing Compiler
- Operate under fundamental constraint
- Must not cause any change in program behavior
- Except, possibly when program making use of nonstandard language features
- Often prevents it from making optimizations that would only affect behavior under pathological conditions.
- Example: "What if one of the arrays overlap another one, then optimization won't work"
- Must not cause any change in program behavior
- Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
- e.g., Data ranges may be more limited than variable types suggest
- Example: put things into lookup tables if you can tell it the range.
- Most analysis is performed only within procedures
- Whole-program analysis is too expensive in most cases
- Newer versions of GCC do interprocedural analysis within individual files
- But, not between code in different files
- Within a procedure you can rearrange things and optimize but within 2 procedures this is much more complex, it requires more analysis.
- Example: What if one procedure calls another procedure which calls another procedure in another file that we do not have access to?
- Most analysis is based only on static information
- Compiler has difficulty anticipating run-time inputs
- Compiler doesn't know what values, variables are going to take in.
- Example: In switch instruction, if the compiler knew which case was the most frequent, it could make something special to make this code faster.
- Compiler has difficulty anticipating run-time inputs
- When in doubt, the compiler must be conservative
- The compiler can't violate any assumptions and do optimizations that may not work.
- The optimizer must not break your code.
Generally Useful Optimization
Useful Optimization
-
Optimizations that you or the compiler should do regardless of processor / compiler
-
Code Motion
- Reduce frequency with which computation performed
- If it will always produce same result
- Especially moving code out of loop
- Reduce frequency with which computation performed
/CSCE-312/Visual%20Aids/Pasted%20image%2020251030132005.png)
n*ibrings us to theirow of the matrixan*iis redundant, it is the same value everytime through the loop- It is a no-brainer to move this redundant computation out of the loop, this will make the loop faster!
- Now instead of performing a costly multiplication on every iteration, we are just moving something to a register.
- The compiler would do this very thing.
- How would you know as the programmer if the compiler could do that? Should you let this optimization to the compiler?
- Solution: Look at the Assembly code! this is the place you can look for code optimizations made by the compiler, compare them with your own.
- How would you know as the programmer if the compiler could do that? Should you let this optimization to the compiler?
Compiler-Generated Code Motion (-O1)
/CSCE-312/Visual%20Aids/Pasted%20image%2020251030132305.png)
- Computes address in a row and increments it
- Instead of doing indexing, all we have to do is increment a pointer which is faster and more efficient.
- Note since we are using doubles you can see the lovely
%xmm0register. - This is probably about the most efficient code to implement this functionality.
Reduction in Strength
- Replace costly operation with simpler one
- Shift, add instead of multiply or divide
16*x --> x << 4- Shifting is a much cheaper operation than multiplication
- Utility machine dependent
- Depends on cost of multiply or divide instruction
- On Intel Nehalem, integer multiply requires 3 CPU cycles
- Recognize sequence of products
- Shifting is a much cheaper operation than multiplication
Before optimization:
for (i = 0; i < n; i++) {
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}
- Copying a vector to different rows of a matrix
After optimization:
int ni = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++)
a[ni + j] = b[j];
ni += n;
}
- Instead of multiplying we repeatedly sum
- Replacing multiplication with addition
- Multiplication = 3 cycles
- Addition = 1 cycle
Share Common Subexpressions
- Reuse portions of expressions
- GCC will do this with –O1
/CSCE-312/Visual%20Aids/Pasted%20image%2020251030133110.png)
- Top-left:
- Notice each one of these indices is a complicated expression
- But notice there are some common subexpressions
- The compiler can figure out how to do some algebra to factor out common subexpressions and then re-write expressions in terms of common sub-expressions so instead of many multiplications you just have one multiplication.
Optimizing Blockers
Optimization Blocker #1: Procedure Calls
Procedure to Convert String to Lower Case
void lower(char *s)
{
size_t i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
strlen(s)computes the length of the string (char array)- What is the asymptotic complexity of this function?
strlen(s)checks in O(n) time and you are doing it for each element ins- so this function is O(n^2)
- Compiler doesn't really know
strlen(s)so you have to make optimizations yourself in this case.- It might know that it is not modifying the string but other than that it does not know what it does.
Lower Case Conversion Performance
- Time quadruples when double string length
- Quadratic performance
/CSCE-312/Visual%20Aids/Pasted%20image%2020251030133952.png)
- You should't take that long to process half a million characters!
Improving Performance
void lower(char *s)
{
size_t i;
size_t len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
- Move call to
strlen()outside of loop - Since result does not change from one iteration to another
- This is a form of code motion
Lower Case Conversion Performance
- Time doubles when double string length
- Linear performance of lower2
/CSCE-312/Visual%20Aids/Pasted%20image%2020251030134125.png)
Optimization Blocker: Procedure Calls
- Why couldn’t compiler move
strlenout of inner loop?- Procedure may have side effects
- Alters global state each time called
- Function may not return same value for given arguments
- Depends on other parts of global state
- Procedure lower could interact with
strlen
- Procedure may have side effects
- Warning:
- Compiler treats procedure call as a black box
- Weak optimizations near them
- Remedies:
- Use of inline functions
- GCC does this with –O1
- Within single file
- GCC does this with –O1
- Do your own code motion
- Use of inline functions
Notes:
- Function inlining:
- If function is sufficiently small the compiler saves itself the overhead of calling a function a literally just replace the call instruction with the actual code of the function
- we cannot do this with very big functions because the size of the programs would blow up.
- Allows us to make better inter-procedural analysis.
size_t lencnt = 0;
size_t strlen(const char *s)
{
size_t length = 0;
while (*s != '\0') {
s++; length++;
}
lencnt += length;
return length;
}
Memory Matters
C code:
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}
Assembly code:
# sum_rows1 inner loop
.L4:
movsd (%rsi,%rax,8), %xmm0 # FP load
addsd (%rdi), %xmm0 # FP add
movsd %xmm0, (%rsi,%rax,8) # FP store
addq $8, %rdi
cmpq %rcx, %rdi
jne .L4
-
Code updates
b[i]on every iteration -
Why couldn’t compiler optimize this away?
-
Vector
boverlaps matrixa- If we modify
bit will be reflected in a later iteration of the loop
- If we modify
-
The compiler has to assume that any pointer could be another name (alias)
-
The compiler just can't do that kind of code motion, you have to do it.
-
There area "annotations" to tell the compiler what you want in fortran.
Memory Aliasing
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}
double A[9] =
{ 0, 1, 2,
4, 8, 16},
32, 64, 128};
double B[3] = A+3;
sum_rows1(A, B, 3);
- No-one would do it this way but the compiler doesn't know that!
- Code updates
b[i]on every iteration - Must consider possibility that these updates will affect program behavior
Removing Aliasing
C code:
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows2(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
double val = 0;
for (j = 0; j < n; j++)
val += a[i*n + j];
b[i] = val;
}
}
Assembly code:
# sum_rows2 inner loop
.L10:
addsd (%rdi), %xmm0 # FP load + add
addq $8, %rdi
cmpq %rax, %rdi
jne .L10
- No need to store intermediate results
- Now within this loop there is only one array access instead of two.
Optimization Blocker: Memory Aliasing
- Aliasing
- Two different memory references specify single location
- Easy to have happen in C
- Since allowed to do address arithmetic
- Direct access to storage structures
- Get in habit of introducing local variables
- Accumulating within loops
- Your way of telling compiler not to check for aliasing
Exploiting Instruction-Level Parallelism
Exploiting Instruction-Level Parallelism
- Need general understanding of modern processor design
- Hardware can execute multiple instructions in parallel
- Performance limited by data dependencies
- Simple transformations can yield dramatic performance improvement
- Compilers often cannot make these transformations
- Lack of associativity and distributivity in floating-point arithmetic
Notes:
- Associativity and these other properties do not actually work in x86, you have to optimize them yourself.
Benchmark Example: Data Type for Vectors
/* data structure for vectors */
typedef struct{
size_t len;
data_t *data;
} vec;
- The type
data_tis a type alias — you can redefine it to test different numeric types (int,float,double, etc.) to see how data type affects performance.
/CSCE-312/Visual%20Aids/Pasted%20image%2020251103230704.png)
/* retrieve vector element
and store at val */
int get_vec_element(*vec v, size_t idx, data_t *val) {
if (idx >= v->len)
return 0; // Out of bounds
*val = v->data[idx]; // Read the value into *val
return 1;
}
- It looks simple, but notice how it adds function call overhead and bounds checks for each element access. When used in a tight loop, these extra steps can heavily affect performance.
Data Types
- Use different declarations for
data_tintlongfloatdouble
Benchmark Computation
Compute sum or product of vector elements
void combine1(vec_ptr v, data_t *dest)
{
long int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
How it works:
- This function loops over all vector elements and combines them into one scalar.
OPandIDENTare macros that change depending on the operation:- If you’re summing →
OP=+andIDENT=0. - If you’re multiplying →
OP=*andIDENT=1.
- If you’re summing →
Why we benchmark it?
- It’s a controlled, repeatable workload that stresses:
- Memory access patterns (
v->data[i]) - Function call overhead (
get_vec_element) - Arithmetic throughput (
OP)
- Memory access patterns (
- We can time it for different
n(vector lengths) and data types to measure how efficiently the CPU handles the operation per element.
Cycles Per Element (CPE)
- Convenient way to express performance of program that operates on vectors or lists
- Length =
n - In our case: CPE = cycles per OP
T = CPE*n + OverheadT= total cycles (execution time * CPU frequency)CPEis slope of line (CPE = slope)n= number of elementsOverhead= fixed cost (function setup, loop entry/exit, etc.)
/CSCE-312/Visual%20Aids/Pasted%20image%2020251030135724.png)
- The function might have had 9 cycles per element, but after optimizations (like inlining or better memory access), it goes down to 6 cycles per element.
- Note: A smaller slope means better performance
Benchmark Performance
void combine1(vec_ptr v, data_t *dest)
{
long int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
- Compute sum or product of vector elements
| Method | Integer | Double FP | ||
|---|---|---|---|---|
| Operation | Add | Mult | Add | Mult |
| Combine1 unoptimized | 22.68 | 20.02 | 19.98 | 20.18 |
| Combine1 –O1 | 10.12 | 10.12 | 10.17 | 11.14 |
Basic Optimizations
void combine4(vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(v);
data_t *d = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}
- Move
vec_lengthout of loop - Avoid bounds check on each cycle
- Accumulate in temporary
| Method | Integer | Double FP | ||
|---|---|---|---|---|
| Operation | Add | Mult | Add | Mult |
| Combine1 –O1 | 10.12 | 10.12 | 10.17 | 11.14 |
| Combine4 | 1.27 | 3.01 | 3.01 | 5.01 |
- Eliminates sources of overhead in loop
Not on Exam 3.
Modern CPU Design (Ex4)
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106125750.png)
- We can make these things out of Digital Logic Design.
- This is how a computer is organized
- Separate into two main things
- Instruction control
- Fetch: figures how/what instructions suppose to do
- Uses IP (Instruction Pointer) to get Instructions from the instruction cache
- Keep instructions in the instruction cache
- Instructions go from the Cache to the Instruction Decode -> figures out what the instructions are suppose to do.
- There will be a branch predictor -> predict what instruction will be next -> signal goes back to the instruction control function for confirmation.
- Receives signal if the branch was predicted correctly and also was the target correct? (predict where it goes)
- The branch predictor has to both predict the instruction and the target (where it goes).
- Branch targets are cached, the branch predictor maybe just uses this and assumes it is correct but it may actually be incorrect.
- Receives signal if the branch was predicted correctly and also was the target correct? (predict where it goes)
- Execution
- Data cache: stores recently loaded data
- Functional units -> very complex
- Instruction control
- Array of registers is accessed by Control and Execution branches.
Superscalar Processor
- Definition: A superscalar processor can issue and execute multiple instructions in one cycle****. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.**
- Instructions are retrieved from a sequential structure frame
- To you (looking at the assembly language) like they execute one at a time, but the processor figures out how to execute them many at a time, enabling parallelism.
- If one in the middle of the process is a jump instruction, then the branch predictor comes in.
- We stop the instructions beyond the jump, and start fetching instructions for the branch and evaluate with branch predictor
- We want a branch to fall through as often as possible (branch condition is usually true -> we want this branch to typically be taken)
- Fetch block: instructions that are fetch in one cycle -> make it so that if there is a taken branch, that is the last instruction in that fetch block.
- Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have.
- The independence between instructions that most programs have
- Most modern CPUs are superscalar.
- Intel Atom was a processor that was single-issue, that is the only example of a non superscalar CPU
- Intel: since Pentium (1993)
Pipelined Functional Units
long mult_eg(long a, long b, long c) {
long p1 = a*b;
long p2 = a*c;
long p3 = p1 * p2;
return p3;
}
a * banda * care independent, they do not meet each other -> you can compute them in parallel.- But
p1 * p2depends on both of them → it must wait until both finish.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|
| Stage 1 | a*b | a*c | p1*p2 | ||||
| Stage 2 | a*b | a*c | p1*p2 | ||||
| Stage 3 | a*b | a*c | p1*p2 |
- Top row is "Time"
The computation process in a pipeline way, like in assembly.
- Divide computation into stages
- Pass partial computations from stage to stage
- Stage i can start on new computation once values passed to i+1
- E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles
- A multiplication takes 3 cycles, but we can begin a new multiplication in each cycles so we can complete 3 multiplications on these 3 cycles without having to require more cycles, this is because of pipelining.
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106132046.png)
Haswell CPU
- 8 Total Functional Units
Multiple instructions can execute in parallel
- 2 load, with address computation
- 1 store, with address computation
- 4 integer
- 2 FP multiply
- 1 FP add
- 1 FP divide
Some instructions take > 1 cycle, but can be pipelined
Instruction Latency Cycles/Issue
Load / Store 4 1
Integer Multiply 3 1
Integer/Long Divide 3-30 3-30
Single/Double FP Multiply 5 1
Single/Double FP Add 3 1
Single/Double FP Divide 3-15 3-15
- Note
leaqis (computing an address) is completely separate, it is not computed this way - Overlapping the execution to exploit some parallelism
Notes:
- Divide is very costly, sometimes we use it as multiply
x86-64 Compilation of Combine4
- Inner Loop (Case: Integer Multiply)
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106132756.png)
- Latency = the time an instruction takes.
- How long does each instruction take -> then add that up
- This is not right, multiply is pipelined.
- Some instructions can go completely in parallel, and some rely on a previous instruction.
- It doesn't really take 3 cycles to do the multiply -> it is faster because of parallelism
Combine4 = Serial Computation (OP = *)
- Computation (length=8)
x = ((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) * d[4]) * d[5]) * d[6]) * d[7]) - Sequential dependence
- Performance: determined by latency of OP
/CSCE-312/Visual%20Aids/Pasted%20image%2020251123225249.png)
x0 = IDENTITY
x1 = x0 * d[0]
x2 = x1 * d[1]
x3 = x2 * d[2]
...
- Each multiply depends on the result of the previous multiplication.
- Even if multipliers are pipelined, dependency breaks parallelism.
- Latency of a (pipelined) multiply = ~3 cycles
- Throughput = 1 multiply per cycle
- BUT because each multiply depends on the previous one, we still must respect its latency.
Loop unrolling (2x1)
- Loop unrolling does two main things:
- Reduces branches & overhead
- Good for small operations like integer addition.
- Gives compiler more scheduling space
- Instructions can be rearranged to use parallel functional units.
- Reduces branches & overhead
- But loop unrolling does NOT automatically eliminate dependency chains.
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}
- Perform 2x more useful work per iteration
- We can replace the code for that loop -> this is called loop unrolling
- Why would you do that? -> to avoid a branch
- The actual instructions that result from that can be move and schedule to exploit instruction parallelism
Effect of Loop Unrolling
| Method | Integer | Double FP | ||
|---|---|---|---|---|
| Operation | Add | Mult | Add | Mult |
| Combine4 | 1.27 | 3.01 | 3.01 | 5.01 |
| Unroll 2x1 | 1.01 | 3.01 | 3.01 | 5.01 |
| Latency Bound | 1.00 | 3.00 | 3.00 | 5.00 |
- Helps integer add
- Achieves latency bound
- Others don’t improve. Why?
- Still sequential dependency
x = (x OP d[i]) OP d[i+1];
This still expands to:
x = (((...((x * d[i]) * d[i+1])) ...)
- Still dependent on
x - Even though two elements are processed per loop iteration:
(x OP d[i])depends onx( (x OP d[i]) OP d[i+1] )depends on both- Next iteration depends on
xagain
Loop Unrolling with Reassociation (2x1a)
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) { // Compare to before:
x = x OP (d[i] OP d[i+1]); // x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}
- Can this change the result of the computation?
- Yes, for FP. Why?
What changed?
- Before:
- Step A:
t1 = x OP d[i](depends on x) - Step B:
x = t1 OP d[i+1](depends on t1)
- Step A:
- After:
- Step A:
t = d[i] OP d[i+1](independent of x) - Step B:
x = x OP t(depends on previous x)
- Step A:
Why this breaks part of the dependency chain
d[i] OP d[i+1]can start early, before x is ready- Next iteration can start computing
(d[i+2] OP d[i+3])in parallel - The x update still depends on old x, but now half the work is off the dependency chain.
Why FP results change
- Floating-point is not associative:
- (a + b) + c ≠ a + (b + c)
- (a * b) * c ≠ a * (b * c)
- Reordering changes rounding.
Effect of Reassociation
| Method | Integer | Double FP | ||
|---|---|---|---|---|
| Operation | Add | Mult | Add | Mult |
| Combine4 | 1.27 | 3.01 | 3.01 | 5.01 |
| Unroll 2x1 | 1.01 | 3.01 | 3.01 | 5.01 |
| Unroll 2x1a | 1.01 | 1.51 | 1.51 | 2.51 |
| Latency Bound | 1.00 | 3.00 | 3.00 | 5.00 |
| Throughput Bound | 0.50 | 1.00 | 1.00 | 0.50 |
-
4 func. units for int + 2 func. units for load
-
2 func. units for FP * 2 func. units for load
-
Nearly 2x speedup for
Int *,FP +,FP *- Reason: Breaks sequential dependency
x = x OP (d[i] OP d[i+1]);
- Why is that?
- Reason: Breaks sequential dependency
-
What changed:
- Ops in the next iteration can be started early (no dependency)
-
Overall Performance
- N elements, D cycles latency/op
- (N/2+1) * D cycles: CPE = D/2
/CSCE-312/Visual%20Aids/Pasted%20image%2020251123230115.png)
Loop Unrolling with Separate Accumulators (2x2)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT;
data_t x1 = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}
- Different form of reassociation
What does this do?
- This is the key idea:
x0handles even elementsx1handles odd elements
- Now we have two independent accumulator chains:
x0 = x0 * d[0] → * d[2] → * d[4] ...x1 = x1 * d[1] → * d[3] → * d[5] ...
Why this is powerful
- x0 and x1 do not depend on each other
- CPU can run them in different functional units
- Two multiply pipelines → we get almost double throughput.
Effect of Separate Accumulators
| Method | Integer | Double FP | ||
|---|---|---|---|---|
| Operation | Add | Mult | Add | Mult |
| Combine4 | 1.27 | 3.01 | 3.01 | 5.01 |
| Unroll 2x1 | 1.01 | 3.01 | 3.01 | 5.01 |
| Unroll 2x1a | 1.01 | 1.51 | 1.51 | 2.51 |
| Unroll 2x2 | 0.81 | 1.51 | 1.51 | 2.51 |
| Latency Bound | 1.00 | 3.00 | 3.00 | 5.00 |
| Throughput Bound | 0.50 | 1.00 | 1.00 | 0.50 |
-
Int + makes use of two load units
x0 = x0 OP d[i];x1 = x1 OP d[i+1];
-
2x speedup (over unroll2) for
Int *,FP +,FP * -
What changed:
- Two independent “streams” of operations
-
Overall Performance
- N elements, D cycles latency/op
- Should be (N/2+1) * D cycles: CPE = D/2
- CPE matches prediction!
- CPE = 3 / 2 = 1.5 cycles per element
/CSCE-312/Visual%20Aids/Pasted%20image%2020251123233856.png)
Unrolling & Accumulating
-
Idea
- Can unroll to any degree L
- Can accumulate K results in parallel
- L must be multiple of K
-
Limitations
- Diminishing returns
- Too many accumulators = diminishing returns
- Cannot go beyond throughput limitations of execution units
- Large overhead for short lengths
- Short arrays → overhead not worth it
- Finish off iterations sequentially
- Diminishing returns
What About Branches?
- Challenge
- Instruction Control Unit must work well ahead of Execution Unit to generate enough operations to keep EU busy
- When encounters conditional branch, cannot reliably determine where to continue fetching
/CSCE-312/Visual%20Aids/Pasted%20image%2020251123234746.png)
Branch Outcomes
- When encounter conditional branch, cannot determine where to continue fetching
- Branch Taken: Transfer control to branch target
- Branch Not-Taken: Continue with next instruction in sequence
- Cannot resolve until outcome determined by branch/integer unit
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106133424.png)
Notes:
- With conditional branches we do not know if we have to jump, those conditions are based on other instructions that are pipelined somewhere else.
- The only thing to do then is to wait until all the instructions this branch depends on are executed.
- From 5 to 300 cycles
- This makes the pipeline cycle worthless
Branch Prediction
- Idea
- Guess which way branch will go
- Begin executing instructions at predicted position
- But don’t actually modify register or memory data
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106133628.png)
Notes:
- When we predict a branch, we begin a speculation -> computing values but not actually putting them in register files or storing them in memory that other processes can see, we are just keeping those in the branch
- Then once we confirm the branch will be taken/not taken, then we easily put what we loaded into registers/memory (updated)
- If the prediction is incorrect,(the branch was supposedly to be taken) -> we toss out all those values that we have computed speculately and then no one ever see those values
- This is very costly but we maintain correct program execution.
- "educated guess"
- the more accurate it can be, the faster our process may be
Branch Prediction Through Loop
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106133951.png)
- Unrolling the loop inside the processor
- With branch prediction we keep fetching over and over again and have different instances of the loop body already ready to be verified
Branch Misprediction Invalidation
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106134123.png)
Branch Misprediction Recovery
/CSCE-312/Visual%20Aids/Pasted%20image%2020251106134147.png)
- Performance Cost
- Multiple clock cycles on modern processor
- Can be a major performance limiter
- Cost: all the time we spent going through the wrong branch
- reduce risk by making branches easier to predict
Getting High Performance
-
Good compiler and flags
-
Don’t do anything stupid
- Watch out for hidden algorithmic inefficiencies
- Write compiler-friendly code
- Watch out for optimization blockers: procedure calls & memory references
- Look carefully at innermost loops (where most work is done)
- Care about 2D array access in row-major order (it is harder to access a different row than a column in the same row)
-
Tune code for machine
- Exploit instruction-level parallelism
- Avoid unpredictable branches
- Try not to do data dependent conditional jumps
- Make code cache friendly (Covered later in course)
Notes:
- Is there any case where you do not want any code optimizations?
- For debugging -> optimization makes it hard to tell what is actually going on
- The debugger can figure out exactly which line of code caused the problem.
- There are cases where optimization actually reveals bugs in the original code
- Maybe an alignment thing
- Doesn't it change some math operations?
- It really shouldn't
- The compiler must respect a standard order of operations that is consistent with what a programmer specifies in an operation.
- Note there exist compiler bugs where the compiler actually did soemthing wrong but these are extremely rara
- in 15 years of programming it happned 3 times to professor Jimenez