Machine-Level Programming IV - Data

#assembly

Class: CSCE-312

Notes:

Today

Arrays
- One-dimensional
- Multi-dimensional (nested)
- Multi-level
Structures
- Allocation
- Access
- Alignment
Floating Point

Array

Array Allocation

Basic Principle
- T A[L];
  - Array of data type T and length L
  - Contiguously allocated region of L * sizeof(T) bytes in memory

Pasted image 20251007135204.png|550

Remember that these multi-byte quantities are little endian (least significant bit goes first)
Addresses are x + 0 through x + 11. (not including address x + 12)
Remember that a pointer has 8 bytes in x86-64
- An array of any kind of pointer will be 8 byte of elements

Array Access

Basic Principle
- T A[L];
  - Array of data type T and length L
  - Identifier A can be used as a pointer to array element 0: Type T*

Reference	Type	Value
val[4]	    int	    3
val	        int *	x
val+1	    int *	x + 4    
&val[2]	    int *	x + 8   
val[5]	    int	    ??
*(val+1)	int	    5          
val + i	    int *	x + 4i

Note: val+1 is the same as &val[1]
- Adds 1 to a pointer but not to the value
Remember that '*' at the start and not in an assigning context is the dereference operator.
val[5] is undefined behavior, which means anything would happen but we would like a runtime system to catch that so that we can know when this happens
- There is probably other memory allocated off to the side of the end of the array, so we could mess with something else and could screw up all the control flow of the program
- This is also known as a segmentation fault.
Other languages like java or python can catch those errors at runtime because they do array bound checking at cost of some performance.

Array Example

Pasted image 20251007135801.png|500

Declaration “zip_dig cmu” equivalent to “int cmu[5]”
Example arrays were allocated in successive 20 byte blocks
- Not guaranteed to happen in general

Notes:

typdedef is pretending that what follows is the declaration of a variable but really treat it as a type
- zip_dig is a type (an array of 5 int's)
For cmu, what would be the byte on address 16?
- It is 1, this is because it is little endian
It is guaranteed that in an array of consecutive elements, they will be arrayed in consecutive order within memory, though the array itself could be at any location in memory.
- It is better if data is aligned on some power of 2 boundaries
- Done for optimization

Array Accessing Example

Pasted image 20251007140014.png|500

A function that accepts a zip_dig and a digit position and returns the digit of zip_dig in that position.

IA32

  # %rdi = z
  # %rsi = digit
movl (%rdi,%rsi,4), %eax  # z[digit]

Register %rdi contains starting address of array
Register %rsi contains array index
Desired digit at %rdi + 4*%rsi
- Note the similarity with the form: x + 4i
Use memory reference (%rdi,%rsi,4)
Integer return value is in %eax

Array Loop Example

C code:

void zincr(zip_dig z) {
  size_t i;
  for (i = 0; i < ZLEN; i++)
    z[i]++;
}

Why would you like to do this? no reason, it is just an example.
Increment the i index of the array z on each pass.

Assembly code:

# %rdi = z
	movl    $0, %eax          #   i = 0
	jmp     .L3               #   goto middle
.L4:                        # loop:
	addl    $1, (%rdi,%rax,4) #   z[i]++
	addq    $1, %rax          #   i++
.L3:                        # middle
	cmpq    $4, %rax          #   i:4 (because array size is 5)
	jbe     .L4               #   if <=, goto loop
	rep; ret

Why are we using %eax as a counter here?
- Since we do not have a return value in this function, we can use it freely.
Note (%rdi,%rax,4) is the address of that element
- addl $1, (%rdi,%rax,4) takes 1, adds it to this array element, and then puts the value back in the array element.
cmpq $4, %rax checks if %rax has reached 4. Why is it comparing with 4 instead of 5?
- i < 5 -> i <= 4
- ZLEN is 5 right? yeah but we do not know why the compiler changes the numbers, maybe because it is worried a value may be outside of the range. Sometimes compiler uses <= rather than <, that is something that just happens, there is no negative effect on this.
Note we are using jbe (unsigned below comparison)

Multidimensional (Nested) Arrays

Declaration: T A[R][C];
- 2D array of data type T
- R rows, C columns
- Type T element requires K bytes
Array Size
- R * C * K bytes
Arrangement
- Row-Major Ordering
  - Reads row by row in order (from left to right)
```
A[0][0]   ...   A[0][C-1]
.                       .
.                       .
.                       .
A[R-1][0] ... A[R-1][C-1]
```
  - In the computer everything is linear, one address after the other, but here we can draw it two dimensionally, because it gives us a better idea of what we are looking at, it looks like a matrix.

Pasted image 20251016125103.png|450

Note row-major ordering, why is this important?
- Because some languages use column-major ordering, why? we do not know, it is just using a different way of ordering, no major effect.
- Accessing elements in column-major order in a row-major system is much more slower so it is important to know the order of your system/language.
The cache of a processor makes accessing to memory faster
- An arithmetic operations takes a couple cycles
- A mov could take hundreds of cycles
- Cache remembers recently accessed pieces of memory, so that we can get it quickly instead of within hundreds of cycles

Nested Array Example

Pasted image 20251016125257.png|450

“zip_dig pgh[4]” equivalent to “int pgh[4][5]"
- Variable pgh: array of 4 elements, allocated contiguously
- Each element is an array of 5 int’s, allocated contiguously
- Summary: 4 rows, 5 columns
“Row-Major” ordering of all elements in memory

Notes:

In C and C++ a two dimensional array is an array of arrays
- They are kind of limited
- This is a primitive array type, you always need to specify sizes (number of rows and columns)
- The dimensions are hooked into the array.
In this example, each row has 5 elements.

Nested Array Row Access

Row Vectors
- A[i] is array of C elements
- Each element of type T requires K bytes
- Starting address A + i * (C * K)
  - K is the size of the object
    - Int -> 4
    - Double/pointer -> 8
    - Struct -> ?? (need to compute size)
  - i (index) will tell you which inner array to access
  - In assembly we can represent this formula with two leaq instructions.

Pasted image 20251016125512.png|450

Nested Array Row Access Code

Pasted image 20251016133401.png|400

C code:

int *get_pgh_zip(int index)
{
  return pgh[index];
}

Assembly code for A + i * (C * K):

# %rdi = index
	leaq (%rdi,%rdi,4),%rax	  # 5 * index
	leaq pgh(,%rax,4),%rax	  # pgh + (20 * index)

Row Vector
- pgh[index] is array of 5 int’s
- Starting address pgh+20*index
Machine Code
- Computes and returns address
- Compute as pgh + 4*(index+4*index)

Notes:

Multiplying rdi * 5 because each row is 5 elements so we multiply the index times 5, this tells us how far into the array we should go to find that row.
- Getting the right offset into the array in terms of int's, then in terms of bytes
We have computed a pointer to the vector in this two dimensional array.

Nested Array Element Access

Array Elements
- A[i][j] is element of type T, which requires K bytes
- Address A + i * (C * K) + j * K
  - = A + (i * C + j)* K

Pasted image 20251016134002.png|500

Nested Array Element Access Code

Pasted image 20251016133401.png|400

C code:

int get_pgh_digit
  (int index, int dig)
{
  return pgh[index][dig];
}

Assembly code for A + i * (C * K) + j * K:

leaq	(%rdi,%rdi,4), %rax	 # 5*index
addl	%rax, %rsi	         # 5*index+dig
movl	pgh(,%rsi,4), %eax	 # M[pgh + 4*(5*index+dig)]

Array Elements
- pgh[index][dig] is int
- Address: pgh + 20*index + 4*dig
  - Tells us the byte in memory where this value lives
  - = pgh + 4*(5*index + dig)

Notes:

A two dimensional array is exactly a matrix
Correct:

Multi-Level Array Example

Syntactically may look the same, but the way of seeing it in memory is more versatile

zip_dig cmu = { 1, 5, 2, 1, 3 };
zip_dig mit = { 0, 2, 1, 3, 9 };
zip_dig ucb = { 9, 4, 7, 2, 0 };

#define UCOUNT 3
int *univ[UCOUNT] = {mit, cmu, ucb};

Variable univ denotes array of 3 elements
Each element is a pointer
- 8 bytes
Each pointer points to array of int’s
- Each one points to a different one-dimensional array

Pasted image 20251016134434.png|500

Element Access in Multi-Level Array

C code:

int get_univ_digit
  (size_t index, size_t digit)
{
  return univ[index][digit];
}

Assembly code:

salq    $2, %rsi              # 4*digit
addq    univ(,%rdi,8), %rsi   # p = univ[index] + 4*digit
movl    (%rsi), %eax          # return *p
ret

We are taking the digit index, multiply it by 4 to have it in terms of bytes
Add it to the pointer that we are getting out of memory, addressing it in chunks of 8 because we have an array of elements of 8 bytes.
Return the address

Pasted image 20251016134623.png|400

Computation
- Element access Mem[Mem[univ+8*index]+4*digit]
- Must do two memory reads
  - First get pointer to row array
  - Then access element within array (get the digit)

NXN Matrix Code

Fixed dimensions
- Know value of N at compile time
- With primitive types we need to know how many rows and columns, otherwise we cannot do the arithmetic
Variable dimensions, explicit indexing
- Traditional way to implement dynamic arrays
- Go to the indexes manually using the following formula: ((i)*(n)+(j))
Variable dimensions, implicit indexing
- Now supported by gcc

Pasted image 20251016135046.png|350

We can make a two-dimensional array by using one-dimensional arrays using pointers
Note: we are simulating a two-dimensional array using a one-dimensional array
gcc now sometimes supports passing two-dimensional arrays as arguments.

16 X 16 Matrix Access

Array Elements
- Address A + i * (C * K) + j * K
- C = 16, K = 4

C code:

/* Get element a[i][j] */
int fix_ele(fix_matrix a, size_t i, size_t j) {
  return a[i][j];
}

Assembly code:

# a in %rdi, i in %rsi, j in %rdx
salq    $6, %rsi             # 64*i              -> (C * K)
addq    %rsi, %rdi           # a + 64*i          -> (A + i * (C * K))
movl    (%rdi,%rdx,4), %eax  # M[a + 64*i + 4*j] -> (A + i * (C * K) + j * K)
ret

n X n Matrix Access

Array Elements
- Address A + i * (C * K) + j * K
- C = n, K = 4
- Must perform integer multiplication

C code:

/* Get element a[i][j] */
int var_ele(size_t n, int a[n][n], size_t i, size_t j) {
  return a[i][j];
}

Assembly code:

  # n in %rdi, a in %rsi, i in %rdx, j in %rcx
  imulq   %rdx, %rdi           # n*i
  leaq    (%rsi,%rdi,4), %rax  # a + 4*n*i
  movl    (%rax,%rcx,4), %eax  # a + 4*n*i + 4*j
  ret

Structures

Structure Representation

Pasted image 20251021125012.png|450

Structure represented as block of memory
- Big enough to hold all of the fields
Fields ordered according to declaration
- Even if another ordering (could be better) could yield a more compact representation
Compiler determines overall size + positions of fields
- Machine-level program has no understanding of the structures in the source code

Note:

In C++, struct is just a class where everything is public
In C, this is creating a struct type with the type rec
- Still need to specify struct rec to use the type
- C++ still recognizes this syntax but does not require it
Remember size_t is the same as unsigened long
a is 4 bytes
- Ends at position 15
i starts at position 16 and is 8 bytes because it is an unsigned long
next is a pointer so it takes 8 bytes as well.
Some fields are more frequency accessed than others so it would be a more optimized approach to order them by groups of frequent vs unfrequent.

Generating Pointer to Structure Member

Pasted image 20251021125914.png|450

Generating Pointer to Array Element
- Offset of each structure member determined at compile time
- Compute as r + 4*idx

C code:

int *get_ap
 (struct rec *r, size_t idx)
{
  return &r->a[idx];
}

Assembly code:

  # r in %rdi, idx in %rsi  
  leaq  (%rdi,%rsi,4), %rax
  ret

We do not need to do any offset for the first element
rdi is the pointer to the struct
rsi is the index
rdi + rsi*4 and put value into rax

Following Linked List

C code:

struct rec {
    int a[4];
    int i;
    struct rec *next;
};

void set_val
  (struct rec *r, int val)
{
  while (r) {
    int i = r->i;
    r->a[i] = val;
    r = r->next;
  }
}

Chases down the linked list, and takes i as index into the a array and sets that element as some value that is passed as a parameter
Note we are making the current pointer, the next pointer to continue iterating over the next elements of the linked list.

Assembly code:

.L11:                         # loop:
  movslq  16(%rdi), %rax      #   i = M[r+16]	  
  movl    %esi, (%rdi,%rax,4) #   M[r+4*i] = val
  movq    24(%rdi), %rdi      #   r = M[r+24]
  testq   %rdi, %rdi          #   Test r
  jne     .L11                #   if !=0 goto loop

movslq 16(%rdi), %rax: we are taking int 4 bytes and sign extending it 8 bytes.
movl is saying: take val and put it into a[i]
Note r->next is the same thing as r + 24, since this is where the next pointer is found in the struct
testq: If the pointer is not null then we loop, otherwise it is a fall-through.
rdi = pointer to the beginning of a list

But this assembly code is not quite right!

In a way that could cause a failure/bug
Why is this c while loop, not the same as this assembly code?
- It does not test at the beginning!
- What if r is initially null? then this would be a segmentation fault.
Study this example! - it shows everything we need to know about accessing structs and arrays.

Structures & Alignment

Unaligned Data
- Not knowing anything about alignment
  - 1 byte, 4 bytes, 4 bytes, and 8 bytes
Aligned Data
- Primitive data type requires K bytes
- Address must be multiple of K
- Note we needed to path 3 bytes in order for the array to have addresses in multiples of 4 (for int array)
- The double should go on an address that is a multiple of 8.
- Ends up costing us 24 bytes because of alignment rules and those path bytes.
  - Nothing goes on those bytes, they basically go to waste, but they do not go to waste because they allow your program to have better performance
  - Example: accessing an int that is not in a multiple of 4 address causes an exception

Alignment Principles

Aligned Data
- Primitive data type requires K bytes
- Address must be multiple of K
- Required on some machines; advised on x86-64
Motivation for Aligning Data
- Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent)
  - Inefficient to load or store datum that spans quad word boundaries
  - Virtual memory trickier when datum spans 2 pages
    - When a piece of data spans two virtual pages
    - With proper alignment this never happens
Compiler
- Inserts gaps in structure to ensure correct alignment of fields

Specific Cases of Alignment (x86-64)

1 byte: char, …
- no restrictions on address (always aligned)
2 bytes: short, …
- lowest 1 bit of address must be 0₂
- In hexadecimal, the last digit must be even (0, 2, 4, 6, 8, A, C, E).
4 bytes: int, float, …
- lowest 2 bits of address must be 00₂
- In hexadecimal, the last digit must be 0, 4, 8, or C.
8 bytes: double, long, char *, …
- lowest 3 bits of address must be 000₂
- In hexadecimal, the last digit must be 0 or 8.
16 bytes: long double (GCC on Linux)
- lowest 4 bits of address must be 0000₂
- In hexadecimal, the last digit must be 0.

Example question:

Question type: "Here is an hexadecimal address ..., is this a properly aligned address according to its type?"
Example question: Consider the hexadecimal address 0x12345678 and a requirement for 4-byte (int) alignment.
- Alignment Requirement: 4-byte alignment means the address must be a multiple of 4.
- Hexadecimal Address: 0x12345678
- Modulo Check:
  - The last hexadecimal digit is 8.
  - Since 8 is divisible by 4 (8 mod 4 = 0), the address 0x12345678 is properly aligned for a 4-byte type.

Satisfying Alignment with Structures

"We have to satisfy alignment within structures and between structures"

Within structure:
- Must satisfy each element’s alignment requirement
Overall structure placement
- Each structure has alignment requirement K
  - K = Largest alignment requirement of any element within the structure
- Initial address & structure length must be multiples of K
Example:
- K = 8, due to double element (largest element in struct)

Meeting Overall Alignment Requirement

For largest alignment requirement K
Overall structure must be multiple of K
- Now we have 7 path bytes that will ensure the next struct address is aligned correctly

Arrays of Structures

Overall structure length multiple of K
Satisfy alignment requirement for every element

Pasted image 20251021132714.png|500 Pasted image 20251021132724.png|150

Accessing Array Elements

Compute array offset 12*idx
- sizeof(S3), including alignment spacers
Element j is at offset 8 within structure
Assembler gives offset a+8
- Resolved during linking

Pasted image 20251021132820.png|500 Pasted image 20251021132831.png|150

rdi is the index`
leaq instruction is just multiplying rdi by 3
movzwl: move zero extending word to long (we multiply it by 4.)
- Move it into the long that we return.
- You can just say movw and that will still work
  - Returning eax so it does not matter

Saving Space

Put large data types first
Effect (K=4)
- We went from wasting 6 bytes to wasting 2 bytes!
- Why would you do the left option where you can do it more efficiently just like the right one?
  - Maybe readability
  - Maybe the most frequent access field, but still probably not as important
- Other languages like java may move fields and objects around to make it more efficient but C is nice in the sense that it give you control over this.

Floating Point

Floating point refers to how computers represent and process real numbers (numbers with decimals) — like 3.14159 or -0.0001 — rather than integers.

Because floating-point math is more complex (it involves exponents, rounding, precision, etc.), CPUs historically used special hardware and instructions to handle it efficiently.

Background

History
- x87 FP (Legacy Floating Point Unit)
  - The x87 refers to the original Floating Point Unit (FPU) used in early Intel architectures (8087, 80387, etc.).
  - It used a stack-based model for computation (ST(0), ST(1), ...).
  - It was cumbersome and awkward for compilers and assembly programmers to use — hence “very ugly.”
  - Still supported for backward compatibility, but not used in modern code.
- SSE FP (Streaming SIMD Extensions)
  - Introduced around Pentium III.
  - Uses registers XMM0–XMM15 (128-bit).
  - Instead of a stack, it uses flat registers, which are much easier to program with.
  - It supports SIMD: Single Instruction, Multiple Data — doing the same operation on multiple numbers at once.
  - It was a game-changer for multimedia, graphics, and scientific applications.
  - “Special case use of vector instructions” — because SSE was originally designed for graphics/multimedia (vector-like data).
- AVX FP (Advanced Vector Extensions)
  - Introduced after SSE — the newest and most powerful FP system in CPUs.
  - Expands registers to 256 bits (YMM0–YMM15), and later 512 bits (ZMM) with AVX-512.
  - Basically an evolution of SSE, using the same idea but wider registers and better instruction formats.
  - Commonly used today for scientific computing, ML, data analysis, video encoding, etc.

Fun facts:

A GPU is an accelerator
- GPUs have many small cores optimized for parallel floating-point operations — they are specialized “floating point accelerators.”
We used to make FPUs (Floating point accelerators)
- Before CPUs had integrated FPUs, floating-point math was handled by a separate chip (e.g., Intel 8087 coprocessor).
There was this need for vector instructions
- Instructions that can support doing multiple operations parallel in adjacent address, this is faster for things like multimedia and graphics.
- When you want to process large arrays (like pixels, audio samples, or matrices), doing one number at a time is too slow.
- Someone had the idea to make this but in order to be able to represent floating point instructions
AVX stands for "vector instructions" -> an array of floats
- The term Advanced Vector Extensions literally means “we extended the CPU to handle vectors of floats.”
- Take a vector of 8 floats, another vector of 8 floats and perform them as a single instruction.

Programming with SSE3

XMM Registers

Specific registers for working with floating points and parallel executions
16 total, each 16 bytes (xmm0, ..., xmm15)
16 single-byte integers
8 16-bit integers
4 32-bit integers
4 single-precision floats (float)
2 double-precision floats (double)
1 single-precision float
1 double-precision float

Scalar & SIMD Operations

Scalar Operations: Single Precision
- The first s in addss means "scalar"
- Adds one single-precision float in %xmm0 to another single-precision float in %xmm1 and stores in %xmm1
- Note the addss instruction, it stands for "scalar"
SIMD Operations: Single Precision
- SIMD = Singular Instruction - Multiple Data
- Can operate with multiple items of data with one single instruction
- For example you can have a single instruction that adds two vectors all at once
- This is fast, avoid all that fetching of instructions
  - Really good thing for performance
- The next level is MIMD = Multiple Instructions - Multiple Data, also called multi-threading
- addps means "add parallel short"
- This could occur sequentially or in parallel, the latter being faster.
Scalar Operations: Double Precision
- Take the first 8 bytes, consider them as a double.

Note: the actual representation of floating point numbers, as we saw with int's, is much more complex, way may look at it later.

FP Basics

Arguments passed in %xmm0, %xmm1, ...
Result returned in %xmm0.
All XMM registers caller-saved

Pasted image 20251021135417.png|550

%xmm0 is like %rax but for floats.
For double is exactly the same but we have to specify d for "double precision".

FP Memory Referencing

Integer (and pointer) arguments passed in regular registers
FP values passed in XMM registers
Different mov instructions to move between XMM registers, and between memory and XMM registers

Pasted image 20251021135720.png|450

movapd is moving xmm0 into xmm1, this copy will eventually be x + v
movsd is moving "scalar" with "double" precision and copy what rdi is pointing to, to xmm0
addsd takes xmm0 and adds it to xmm1, stores in xmm1.
movsd is moving the result of the addition to the address pointed by rdi
Basically much the same as it is done with integer data.
Remember you should have structures aligned.

Other Aspects of FP Code

Lots of instructions
- Different operations, different formats, ...
- There are many other very specific floating point instructions
Floating-point comparisons
- Instructions ucomiss and ucomisd
- Set condition codes CF, ZF, and PF
Using constant values
- Set XMM0 register to 0 with instruction xorpd %xmm0, %xmm0
- This is the only constant you can get easily, other constants have to be loaded from memory.
  - Others loaded from memory
Floating-point representation
- IEEE has set up standards for representing floating-point numbers in computing
- Consider the mantiza (base), exponent, etc.

Summary

Arrays
- Elements packed into contiguous region of memory
- Use index arithmetic to locate individual elements
Structures
- Elements packed into single region of memory
- Access using offsets determined by compiler
- Possible require internal and external padding to ensure alignment
Combinations
- Can nest structure and array code arbitrarily
  - Can have an array of structs, and those structs can contain arrays, and so on.
Floating Point
- Data held and operated on in XMM registers
- We use special instructions for floating point data

...

There are some exercises on the last few slides of Machine-Level Programming IV - Data