Program Optimization

Class: CSCE-312


Notes:

Today

Overview

Performance Realities

Notes:

Optimizing Compilers

Notes:

Limitations of Optimizing Compiler

Generally Useful Optimization

Useful Optimization

Pasted image 20251030132005.png|600

Compiler-Generated Code Motion (-O1)

Pasted image 20251030132305.png|600

Reduction in Strength

Before optimization:

for (i = 0; i < n; i++) {
  int ni = n*i;
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
}

After optimization:

int ni = 0;
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
  ni += n;
}

Share Common Subexpressions

Pasted image 20251030133110.png|600

Optimizing Blockers

Optimization Blocker #1: Procedure Calls

Procedure to Convert String to Lower Case

void lower(char *s)
{
  size_t i;
  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

Lower Case Conversion Performance

Pasted image 20251030133952.png|450

Improving Performance

void lower(char *s)
{
  size_t i;
  size_t len = strlen(s);
  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

Lower Case Conversion Performance

Pasted image 20251030134125.png|450

Optimization Blocker: Procedure Calls

Notes:

size_t lencnt = 0;
size_t strlen(const char *s)
{
    size_t length = 0;
    while (*s != '\0') {
		s++; length++;
    }
    lencnt += length;
    return length;
}

Memory Matters

C code:

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
		b[i] = 0;
		for (j = 0; j < n; j++)
		    b[i] += a[i*n + j];
	}
}

Assembly code:

# sum_rows1 inner loop
.L4:
        movsd   (%rsi,%rax,8), %xmm0	# FP load
        addsd   (%rdi), %xmm0		# FP add
        movsd   %xmm0, (%rsi,%rax,8)	# FP store
        addq    $8, %rdi
        cmpq    %rcx, %rdi
        jne     .L4

Memory Aliasing

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
		b[i] = 0;
		for (j = 0; j < n; j++)
		    b[i] += a[i*n + j];
	}
}
double A[9] = 
  { 0,   1,   2,
    4,   8,  16},
   32,  64, 128};

double B[3] = A+3;

sum_rows1(A, B, 3);

Removing Aliasing

C code:

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows2(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
		double val = 0;
		for (j = 0; j < n; j++)
		    val += a[i*n + j];
		b[i] = val;
	}
}

Assembly code:

# sum_rows2 inner loop
.L10:
        addsd   (%rdi), %xmm0	# FP load + add
        addq    $8, %rdi
        cmpq    %rax, %rdi
        jne     .L10

Optimization Blocker: Memory Aliasing

Exploiting Instruction-Level Parallelism

Exploiting Instruction-Level Parallelism

Notes:

Benchmark Example: Data Type for Vectors

/* data structure for vectors */
typedef struct{
	size_t len;
	data_t *data;
} vec;

Pasted image 20251103230704.png|300

/* retrieve vector element
   and store at val */
int get_vec_element(*vec v, size_t idx, data_t *val) {
	if (idx >= v->len)
		return 0;           // Out of bounds
	*val = v->data[idx];    // Read the value into *val
	return 1;
}

Data Types

Benchmark Computation

Compute sum or product of vector elements

void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
		data_t val;
		get_vec_element(v, i, &val);
		*dest = *dest OP val;
    }
}

How it works:

Why we benchmark it?

Cycles Per Element (CPE)

Pasted image 20251030135724.png|450

Benchmark Performance

void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
		data_t val;
		get_vec_element(v, i, &val);
		*dest = *dest OP val;
    }
}
Method Integer Double FP
Operation Add Mult Add Mult
Combine1 unoptimized 22.68 20.02 19.98 20.18
Combine1 –O1 10.12 10.12 10.17 11.14

Basic Optimizations

void combine4(vec_ptr v, data_t *dest)
{
	long i;
	long length = vec_length(v);
	data_t *d = get_vec_start(v);
	data_t t = IDENT;
	for (i = 0; i < length; i++)
		t = t OP d[i];
	*dest = t;
}
Method Integer Double FP
Operation Add Mult Add Mult
Combine1 –O1 10.12 10.12 10.17 11.14
Combine4 1.27 3.01 3.01 5.01

Not on Exam 3.

Modern CPU Design (Ex4)

Pasted image 20251106125750.png|500

Superscalar Processor

Pipelined Functional Units

long mult_eg(long a, long b, long c) {
    long p1 = a*b;
    long p2 = a*c;
    long p3 = p1 * p2;
    return p3;
}
1 2 3 4 5 6 7
Stage 1 a*b a*c p1*p2
Stage 2 a*b a*c p1*p2
Stage 3 a*b a*c p1*p2

The computation process in a pipeline way, like in assembly.

Pasted image 20251106132046.png|127

Haswell CPU

Multiple instructions can execute in parallel

Some instructions take > 1 cycle, but can be pipelined

Instruction	                 Latency	Cycles/Issue
Load / Store	                   4	           1
Integer Multiply	               3	           1
Integer/Long Divide	            3-30	        3-30
Single/Double FP Multiply	       5	           1
Single/Double FP Add	           3	           1
Single/Double FP Divide	        3-15	        3-15

Notes:

x86-64 Compilation of Combine4

Pasted image 20251106132756.png|450

Combine4 = Serial Computation (OP = *)

Pasted image 20251123225249.png|200

x0 = IDENTITY
x1 = x0 * d[0]
x2 = x1 * d[1]
x3 = x2 * d[2]
...

Loop unrolling (2x1)

void unroll2a_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    long i;
    
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
		x = (x OP d[i]) OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
		x = x OP d[i];
    }
    *dest = x;
}

Effect of Loop Unrolling

Method Integer Double FP
Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Unroll 2x1 1.01 3.01 3.01 5.01
Latency Bound 1.00 3.00 3.00 5.00
x = (x OP d[i]) OP d[i+1];

This still expands to:

x = (((...((x * d[i]) * d[i+1])) ...)

Loop Unrolling with Reassociation (2x1a)

void unroll2aa_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {    // Compare to before:
		x = x OP (d[i] OP d[i+1]);    // x = (x OP d[i]) OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
		x = x OP d[i];
    }
    *dest = x;
}

What changed?

Why this breaks part of the dependency chain

Why FP results change

Effect of Reassociation

Method Integer Double FP
Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Unroll 2x1 1.01 3.01 3.01 5.01
Unroll 2x1a 1.01 1.51 1.51 2.51
Latency Bound 1.00 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50

Pasted image 20251123230115.png|200

Loop Unrolling with Separate Accumulators (2x2)

void unroll2a_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x0 = IDENT;
    data_t x1 = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
       x0 = x0 OP d[i];
       x1 = x1 OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
		x0 = x0 OP d[i];
    }
    *dest = x0 OP x1;
}

What does this do?

Why this is powerful

Effect of Separate Accumulators

Method Integer Double FP
Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Unroll 2x1 1.01 3.01 3.01 5.01
Unroll 2x1a 1.01 1.51 1.51 2.51
Unroll 2x2 0.81 1.51 1.51 2.51
Latency Bound 1.00 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50

Pasted image 20251123233856.png|225

Unrolling & Accumulating

What About Branches?

Pasted image 20251123234746.png|500

Branch Outcomes

Pasted image 20251106133424.png|500

Notes:

Branch Prediction

Pasted image 20251106133628.png|500

Notes:

Branch Prediction Through Loop

Pasted image 20251106133951.png|600

Branch Misprediction Invalidation

Pasted image 20251106134123.png|550

Branch Misprediction Recovery

Pasted image 20251106134147.png|550

Getting High Performance

Notes: