Machine-Level Programming I - Basics

#assembly

Class: CSCE-312

Notes:

Today: Machine Programming I: Basics

History of Intel processors and architectures
C, assembly, machine code
Assembly Basics: Registers, operands, move
Arithmetic & logical operations

History of Intel processors and architectures

Intel x86 Processors

Dominate laptop/desktop/server market
Evolutionary design
- Backwards compatible up until 8086, introduced in 1978
- Added more features as time goes on
Complex instruction set computer (CISC)
- Many different instructions with many different formats
  - But, only small subset encountered with Linux programs
  - Many different instructions to do the same thing
  - Or same semantics but do different things
- Hard to match performance of Reduced Instruction Set Computers (RISC)
  - Fun fact: RISC-V is an emerging technology architecture being included by many startups
  - Good idea for other reasons: it is much easier to compile for a simpler instruction set, than for a complex instruction set
- But, Intel has done just that!
  - In terms of speed. Less so for low power.
  - Intel matched performance of RISC technology
    - They did a beautiful work
    - They run complex instructions very very fast
    - In terms of speed Intel is better, in terms of power, they are very bad
  - Intel offers background compatibility
    - They own their instruction set

Intel x86 Evolution: Milestones

Name Date Transistors MHz

8086 1978 29K 5-10
- First 16-bit Intel processor. Basis for IBM PC & DOS
  - About 65 thousand bits
  - Address more than 64 kb was the goal
  - Pointers were two dimensional: made programming a hard thing
- 1MB address space
- Just figuring out how to make microprocessors (evolved from video game systems)
- 8286 was very popular, allowed you to have isolation between process and allowed you to use disk as a backup memory, cool stuff.
386 1985 275K 16-33
- First 32 bit Intel processor , referred to as IA32
  - 10 times as many transistors
  - Gives us nice 4GB address space
- Introduced a lot of new instructions to the instruction set
- Added “flat addressing”, capable of running Unix
  - Pointers became a single number that can point to any address
  - It was really hard to program before
- It made it possible for Linus Torvalds to invent Linux
  - 1991~. Started writing a terminal emulator and ended up writing an operative system.
  - Professor actually emailed Linus Torvalds about a bug, and he responded!
- Up to 3MB address space (professor experience)
- 486 actually had a little cashed, a deeper pipeline, and allowed to be a little faster
Pentium 4E 2004 125M 2800-3800
- First 64-bit Intel x86 processor, referred to as x86-64
- They wanted a name that they can trademark.
- Pentium 4 was an ambitious project, with a pipeline that was
  - Pipeline: each stage does something related to executing instructions
    - ...
    - Recon
    - Execute
    - Memory
    - Write back
    - ...
  - Gives you parallelism
Core 2 2006 291M 1060-3500
- First multi-core Intel processor
- Less ambitious pipeline
- Increasing the number of transistors exponentially
Core i7 2008 731M 1700-3900
- Four cores (our shark machines)
- Reaching almost 4 GHz

Intel x86 Processors, cont.

Machine Evolution
- 386 1985 0.3M
- Pentium 1993 3.1M
- Pentium/MMX 1997 4.5M
- PentiumPro 1995 6.5M
- Pentium III 1999 8.2M
- Pentium 4 2001 42M
- Core 2 Duo 2006 291M
- Core i7 2008 731M
Added Features
- Instructions to support multimedia operations
  - Do vector operations on not just integers but also floats
- Instructions to enable more efficient conditional operations
- Transition from 32 bits to 64 bits
  - So that we can have a more comfortable reasonable address space
- More cores

Pasted image 20250916125234.png|275

Core i7 photo
l1 chache: Within these cores there is an l1 cache for data, another one for instructions, etc.
l2 cache is a little slower but still very fast
l3 cache, if you do not find something in the l2 cache you still have the l3 cache to check
- Allows the cores to communicate between each others
Parallel processing occurs this way

2015 State of the Art

Core i7 Broadwell 2015
Desktop Model
- 4 cores
- Integrated graphics
- 3.3-3.8 GHz
- 65W
Server Model
- 8 cores
- Integrated I/O
- 2-2.6 GHz
- 45W

x86 Clones: Advanced Micro Devices (AMD)

Historically
- AMD has followed just behind Intel
- A little bit slower, a lot cheaper
- Pushes Intel to innovate by being a very strong competitor
  - Sometimes their products are better and cheaper, sometimes not reliable, but they get more and more reliable
- Itanium = super powerful computer back in those days
  - Incompatible with x86
- AMD extended the old intel x86 instruction set to handle 64 bits
Then
- Recruited top circuit designers from Digital Equipment Corp. and other downward trending companies
- Built Opteron: tough competitor to Pentium 4
- Developed x86-64, their own extension to 64 bits
  - You can run your 32-bit programs without changing anything, and whenever you ready change to a 64-bit program.
  - This was designed by AMD and works with both intel and AMD processors
Recent Years
- Intel got its act together
  - Leads the world in semiconductor technology
- AMD has fallen behind
  - Relies on external semiconductor manufacturer

Intel's 64-Bit History

...

C, assembly, machine code

Definitions

Architecture: (also ISA: instruction set architecture) The parts of a processor design that one needs to understand or write assembly/machine code.
- It is the interface, the instructions themselves, machine language encodings, addressing, which registers are available, how is memory accessed, that is all in the ISA.
- It is not updated very often, because they would have to support it in many different interfaces
- Examples: instruction set specification, registers.
Microarchitecture: Implementation of the architecture.
- This is the circuits, transistors, the algorithms to implement this architecture.
- Examples: cache sizes and core frequency.
  - How big should the cache, what to keep, and what to throw away.
  - How fast can we clock this, depends on the microarchitecture.
  - How to make the processor more efficiently
Code Forms:
- Machine Code: The byte-level programs that a processor executes
  - Very hard to read and understand
- Assembly Code: A text representation of machine code
  - Text level representation of machine language
  - More or less english looking instructions
Example ISAs:
- Intel: x86, IA32, Itanium, x86-64 (AMD)
- ARM: Used in almost all mobile phones
  - There is RISC5, PowerPC, IBM360, Motorolla 6800 (Apple PC) etc...

Assembly/Machine Code View

Machine_Code_View.png|400

Programmer-Visible State

PC: Program counter
- Address of next instruction
  - Contains the address in the next instruction in memory that should be executed
    - Usually: "get the next few bytes"
    - Assembles an isntruction
- Called “RIP” (x86-64)
Register file
- Heavily used program data
- Old usage of the word file: just means "an array of things"
- Contains the registers rax, rei, rxp, etc...
Condition codes
- Store status information about most recent arithmetic or logical operation
- Used for conditional branching
- Like registers but they are 1 bit, they just store conditions
- Enable control flow in our program
Memory
- Byte addressable array
  - Memory gives back instructions given at a requested address
- Code and user data
- Stack to support procedures

Turning C into Object Code

Code in files p1.c p2.c
Compile with command: gcc –Og p1.c p2.c -o p
- Use basic optimizations (-Og) [New to recent versions of GCC]
- Put resulting binary in file p
Object code
- Machine language stored in a file to be used

From_C_to_Object_Code.png|500

Take programs
Compile them into .s files (assembly files)
Assembler
- Command line tool that will assemble programs into object code.
- Converts .s files into .o files (object files)
- Not yet ready to be executed yet
Link them together into the executable program p
- Takes all machine files and links them together, fills in all the missing information
- Libraries are included here
- Will link the .o files
Executable program
- We call p a binary
- Can be run straight from the command line

Compiling into Assembly

C Code (sum.c)

long plus(long x, long y); 

void sumstore(long x, long y, 
              long *dest)
{
    long t = plus(x, y);
    *dest = t;
}

Needlessly complicated but doing it to illustrate how things work

Obtain with command

gcc –O –S sum.c

Compile code into machine language code and get the assembly file
Think of a pipe in Linux, the compiler does this with the next phase of the compiling process, it just passes the output to the next phase.
Produces file sum.s

Generated x86-64 Assembly

sumstore:
   pushq   %rbx
   movq    %rdx, %rbx
   call    plus
   movq    %rax, (%rbx)
   popq    %rbx
   ret

rbx is the call safe register
- Whoever calls this register expects to have the same value that had the last time we call it
Preserve the value of rbx
Remember the value of rdx into rbx
Calls plus, adds two parameters and returns result
- Will place that sum on the rax register
Take the value returned by plus and stick it into the place in memory where rbx is pointing
ret just means return

**Warning****: Will get very different results on other machines due to different versions of gcc and different compiler settings.

Assembly Characteristics: Data Types

“Integer” data of 1, 2, 4, or 8 bytes
- 4 main type of different "Int" data types
- Data values
  - w means word (2-bytes)
  - l means long word (4-bytes)
  - q means quad word (8-bytes)
- Addresses (untyped pointers)
- x86 helps you here, while RISC5 instructions set usually don't
Floating point data of 4, 8, or 10 bytes
Code: Byte sequences encoding series of instructions
- Some instructions are shorter, others are longer
- In RISC5, all instructions are the same size
- x86 has dynamic instruction sizes, does not waste that much
No aggregate types such as arrays or structures
- Just contiguously allocated bytes in memory
- There is no structs, class, etc.
- You have to write everything yourself or with help of the compiler.
- There is no way to name fields like you do in C/C++

Assembly Characteristics: Operations

Perform arithmetic function on register or memory data
- With x86 you can do accessing memory and arithmetic in one single instruction
  - Makes it more convenient and easier to program
Transfer data between memory and register
- Load data from memory into register
- Store register data into memory
- You can't say I want to take this "memory and put this into this memory" - this is not how it works
Transfer control
- Unconditional jumps to/from procedures
  - Jump to another part of the program or function
- Conditional branches
- The word "jump" and "branch" in computer architecture means the same exact thing
  - Synonyms that we use and switch back and forth
  - We do a "jump" if some condition is true.
    - Think of if statements.

Object Code

Code for sumstore

0x0400595: 
   0x53
   0x48
   0x89
   0xd3
   0xe8
   0xf2
   0xff
   0xff
   0xff
   0x48
   0x89
   0x03
   0x5b
   0xc3

Total of 14 bytes
Each instruction 1, 3, or 5 bytes
Starts at address 0x0400595
Assembler
- Translates .s into .o
- Binary encoding of each instruction
- Nearly-complete image of executable code
- Missing linkages between code in different files
  - It gives an executable that is mostly complete, but still it is missing some stuff, the dynamic linker fills in all of these gaps.
Linker
- Resolves references between files
- Combines with static run-time libraries
  - E.g., code for malloc, printf
- Some libraries are dynamically linked
  - Linking occurs when program begins execution

Machine Instruction Example

C Code

*dest = t:

Store value t where designated by dest

Assembly

movq %rax, (%rbx)

Move 8-byte value to memory
- Quad words in x86-64 parlance
Operands:
- t: Register %rax
- dest: Register %rbx
- *dest: Memory M[%rbx]
Why the parenthesis on (%rbx)
- Mean "in direction"
- Means not rbx itself but what rbx points to
  - The type pointed to is on the q in the movq instruction

Object Code

0x40059e:  48 89 03

3-byte instruction
Stored at address 0x40059e
If we looked at a similar instruction, it will differ in just a few bits that would be the differences in the instructions.
Some of these bytes mean mov1, some of them mean rax, rbx, and some of them mean move in direction (parenthesis, etc...)
- All of that is encoded in that 3-byte sequence
  - In ARM or RISC5 it will take 4 bytes
  - In x86 instructions can be from 1 to very long, they usually give you less than 4 bytes (saves space)
    - We can fit more programs into the cache

Disassembling Object Code

Disassembled

0000000000400595 <sumstore>:
  400595:  53               push   %rbx
  400596:  48 89 d3         mov    %rdx,%rbx
  400599:  e8 f2 ff ff ff   callq  400590 <plus>
  40059e:  48 89 03         mov    %rax,(%rbx)
  4005a1:  5b               pop    %rbx
  4005a2:  c3               retq

You can figure out from the context wether there should be a q or not in front of instructions
In retq somebody made the opposite decision.

Disassembler
objdump –d sum

Useful tool for examining object code
Analyzes bit pattern of series of instructions
Produces approximate rendition of assembly code
- Reads file and disassemble the executable portions of your code and shows it to you in assembly language
- It will not have the labels of your variables and stuff, but it will be something you can more or less read
Can be run on either a.out (complete executable) or .o file
- Recommendation: gdb/ldb debugger gives you an interactive environment where you can run your program and debug it using breakpoints, and other tools.
- Lets you examine data
- This debugger also includes a disassembler.
  - Instead of showing you C code, it will show you the assembly language it disassembles.
  - This is where you get to see how your code is being interpreted by the assembler.

Alternate Disassembly

Object

0x0400595: 
   0x53
   0x48
   0x89
   0xd3
   0xe8
   0xf2
   0xff
   0xff
   0xff
   0x48
   0x89
   0x03
   0x5b
   0xc3

Dissasembly

Dump of assembler code for function sumstore:
 0x0000000000400595 <+0>: push   %rbx
 0x0000000000400596 <+1>: mov    %rdx,%rbx
 0x0000000000400599 <+4>: callq  0x400590 <plus>
 0x000000000040059e <+9>: mov    %rax,(%rbx)
 0x00000000004005a1 <+12>:pop    %rbx
 0x00000000004005a2 <+13>:retq

Within gdb Debugger
- gdb sum
- disassemble sumstore
  - Disassemble procedure
  - Will give you the disassembled code
- x/14xb sumstore
  - Examine the 14 bytes starting at sumstore

What Can be Disassembled?

% objdump -d WINWORD.EXE

WINWORD.EXE:   file format pei-i386

No symbols in "WINWORD.EXE".
Disassembly of section .text:

30001000 <.text>:
30001000:  55             push   %ebp
30001001:  8b ec          mov    %esp,%ebp
30001003:  6a ff          push   $0xffffffff
30001005:  68 90 10 00 30 push   $0x30001090
3000100a:  68 91 dc 4c 30 push   $0x304cdc91

Anything that can be interpreted as executable code
Disassembler examines bytes and reconstructs assembly source

Reverse engineering forbidden by Microsoft End User License Agreement

Can you disassemble Microsoft Word?
- Yes you can if you have access to the executable
- But you are not supposed to if you agree to the liscsence agreement.
May be needed when finding vulnerabilities
- Check: Reverse Engineering courses
- Disassembly process is used often here

Assembly Basics: Registers, operands, move

x86-64 Integer Registers

Pasted image 20250918124904.png|500

Can reference low-order 4 bytes (also low-order 1 & 2 bytes)
Example:
- rax is the 64-bit version of the 32-bit eax register
  - You can still use eax for 32-bit values but if you want you can use rax for longer values
- ax - 16-bytes
- ah - higher order byte
- al - lower order byte
Names don't mean anything except for a couple
- rsp = stack pointer
  - Structure in memory used for passing parameters and know when to return procedures
  - Parameters can be passed on the stack, local variables are also stored in the stack and removed when they are no longer used.
  - Needs to be pointing to some memory that represents the top of the stack
- rbp = frame pointer
  - Points to the beginning of the local variable storage for the current procedure (the procedure you are now)
  - Store things like local variables or temporary variables
  - Everything is always at a constant offset from the frame pointer

Some History: IA32 Registers

Pasted image 20250918130142.png|600

These are the 32-bit versions of the registers
Note: most instructions can use any register, seen later
Where do names come from:
- ax: accumulate
- cx: counter
- dx: data
- bx: base (for an array)
- ...

Moving Data

Moving Data
- movq <Source>, <Dest>:
  - Moves data
  - Very general instruction
  - Coding is different for the different kind of things that it can do
Operand Types (expression after the name of the instruction)
- Immediate: Constant integer data
  - Example: $0x400, $-533
  - Like C constant, but prefixed with ‘$’
  - Encoded with 1, 2, or 4 bytes
    - If a constant doesn't fit in 4-bytes, as many do, the assembler has to split it up into 2 instructions.
- Register: One of 16 integer registers
  - Example: %rax, %r13
  - But %rsp reserved for special use
    - You can use it as a source or destination, but it comes with side effects. You have to be aware of what exactly %rsp is doing
  - Others have special uses for particular instructions
- Memory: 8 consecutive bytes of memory at address given by register
  - Simplest example: (%rax)
    - Instead of consider the value of rax, consider the memory pointed by it
  - Various other “address modes”
  - Means: "use this register as an address"

Pasted image 20250918133345.png|150

`movq` Operand Combinations

Pasted image 20250918133948.png|600

Cannot do memory-memory transfer with a single instruction

Source

Immediate
Register
Memory

Destination

Regular
Memory
Versatile useful instructions
Basically just moving data into memory and out of memory
We have a limited number of registers so we need a tool to interact with memory efficiently

Simple Memory Addressing Modes

Normal (R) Mem[Reg[R]]
- Register R specifies memory address
- Aha! Pointer dereferencing in C
```
movq (%rcx),%rax
```
  - Move what %rcx points to into %rax
  - A "loading" isntruction
Displacement D(R) Mem[Reg[R]+D]
- Register R specifies start of memory region
- Constant displacement D specifies offset
- Means: "memory at register + constant"
- Example + 8
  - 8 bytes beyond rbp
  - "Give me the second element of that array"
  - You can also use it if rbp points to a struct and you want 8 bytes into the struct (element that is 8-bytes pass)
  - In general is used for offsets in structs or objects
```
movq 8 (%rbp),%rdx
```
rbp poitns to local variables
Layed out in order
This is saying: get the second local variable and put it into rdx
Can be read as: Address = 8 + %rbp, move to %rdx

Example of Simple Addressing Modes

C Code

void swap
   (long *xp, long *yp) 
{
  long t0 = *xp;
  long t1 = *yp;
  *xp = t1;
  *yp = t0;
}

Assembly

swap:
   movq    (%rdi), %rax
   movq    (%rsi), %rdx
   movq    %rdx, (%rdi)
   movq    %rax, (%rsi)
   ret

There is a one to one correspondence between these statements and the ones in the C code

Understanding `swap()`

void swap
   (long *xp, long *yp) 
{
  long t0 = *xp;
  long t1 = *yp;
  *xp = t1;
  *yp = t0;
}

Pasted image 20250925130015.png|300

Register	Value
%rdi	    xp
%rsi	    yp
%rax	    t0
%rdx	    t1

swap:
   movq    (%rdi), %rax  # t0 = *xp  
   movq    (%rsi), %rdx  # t1 = *yp
   movq    %rdx, (%rdi)  # *xp = t1
   movq    %rax, (%rsi)  # *yp = t0
   ret

This should be very simple
Is the same thing we are doing in C, jus tin a different and particular language
In this case there is a one-to-one correspondence in the number of statements
In assembly you have to explicitly indicate ret to return
What if we swapped the first two movq statements? (make the first happen second, and the second happen first)
- This would not change the result, the order was different but at the end of the procedure you can't tell the difference
- This means we can do them in parallel, both of them at the same time
- The processor can figure this out and schedule these instructions to be processed at the same time.

...

Complete Memory Addressing Modes

Most General Form
D(Rb,Ri,S) Mem[Reg[Rb]+S*Reg[Ri]+ D]
- D: Constant “displacement” 1, 2, or 4 bytes
- Rb: Base register: Any of 16 integer registers
- Ri: Index register: Any, except for %rsp
- S: Scale: 1, 2, 4, or 8 (why these numbers?)
Special Cases
- Special cases where we omit something (the scale, etc.)
  (Rb,Ri) Mem[Reg[Rb]+Reg[Ri]]
  D(Rb,Ri) Mem[Reg[Rb]+Reg[Ri]+D]
  (Rb,Ri,S) Mem[Reg[Rb]+S*Reg[Ri]]

Address Computation Examples

%rdx	0xf000
%rcx	0x0100
![Pasted image 20250918135311.png	450](/img/user/00%20-%20TAMU%20Brain/5th%20Semester%20(Fall%2025)/CSCE-312/Visual%20Aids/Pasted%20image%2020250918135311.png)

4 bytes each
0x80(,%rdx,2)
- Ommiting base register
  ...

Arithmetic & logical operations

Address Computation Instruction

leaq <Src>, <Dst>
- Src is address mode expression
- Set Dst to address denoted by expression
- Rather than moving the data, it computes the address
  - You can use it before you actually store something into an address
  - Then puts this address into the destination register
Uses
- Computing addresses without a memory reference
  - E.g., translation of p = &x[i];
- Computing arithmetic expressions of the form x + k * y
  - k = 1, 2, 4, or 8
  - You can have that displacement if you like

Example

long m12(long x)
{
  return x*12;
}

Note: return values are always into %rax

Converted to ASM by compiler:

leaq (%rdi,%rdi,2), %rax  # t <- x+x*2
salq $2, %rax             # return t<<2

Example: combine with shift left, easily lets you multiply by 12
- Multiply rdi * 3 and put result back in rax
Was designed to compute addresses, not really arithmetic
- Condition codes remain whatever they were before
- This is useful if we want to remember them from a previous instruction
- A little bit more efficient, an instruction time to complete before you ask it what it did.
  - This is called instruction scheduling

Some Arithmetic Operations

Two Operand Instructions:

  Format	Computation
	addq	Src,Dest	Dest = Dest + Src
	subq	Src,Dest	Dest = Dest − Src
	imulq	Src,Dest	Dest = Dest * Src
	salq	Src,Dest	Dest = Dest << Src	# Also called shlq
	sarq	Src,Dest	Dest = Dest >> Src	# Arithmetic (signed)
	shrq	Src,Dest	Dest = Dest >> Src	# Logical (unsigned)
	xorq	Src,Dest	Dest = Dest ^ Src
	andq	Src,Dest	Dest = Dest & Src
	orq	Src,Dest	Dest = Dest | Src

Watch out for argument order!
- Example: subq Src,Dest
  - You are subtracting Src from Dest, not the other way around
- The Destination is always the thing that is getting modified.
- Sometimes operations are not commutative
  - e.i. subtraction, shifting, etc.
No distinction between signed and unsigned int (why?)
- All of these algorithms produce the same bitwise representation at the end.

One Operand Instructions

incq	Dest	Dest = Dest + 1
decq	Dest	Dest = Dest − 1
negq	Dest	Dest = − Dest    (takes two's complement)
notq	Dest	Dest = ~Dest     (takes one's complement)

See book for more instructions
Kind of redundant instructions, but it is a common idiom in coding to add 1 to things.
A nice but unnecessary thing to have
Make programs simpler

Arithmetic Expression Example

C code:

long arith
(long x, long y, long z)
{
  long t1 = x+y;
  long t2 = z+t1;
  long t3 = x+4;
  long t4 = y * 48;
  long t5 = t3 + t4;
  long rval = t2 * t5;
  return rval;
}

Assembly code:

arith:
  leaq    (%rdi,%rsi), %rax      # t1
  addq    %rdx, %rax             # t2
  leaq    (%rsi,%rsi,2), %rdx        # (y * 3)
  salq    $4, %rdx               # t4: (y * 3) * 16
  leaq    4(%rdi,%rdx), %rcx     # t5: t3 + t4
  imulq   %rcx, %rax             # rval
  ret

Tip: you can think of leaq as having 3 operands.
t2 = result of the previous sum + %rdx
- Which register holds t2?
  - Answer: %rax
- The compiler has allocated rax to be t1, and then again to use t2, why? isn't that a bug?
  - No, t1 is no longer going to be used, so we can deallocated.
leaq (%rsi,%rsi,2), %rdx.
- Why is it multpliying y times 3?
- First it multiples by 3, then salq $4, %rdx shifts that result left by 4 positions, which multiplyis it by 16
- So we get %rdx = y * 3 * 16 which is the same as y * 48.
Now, which register is allocated to t3?
- It is not allocated to any register at all, the compiler just knows x+4 is a value
- You do no need a register, we can just remember this value when doing a computation.
leaq 4(%rdi,%rdx), %rcx
- %rcx = x + 4 + t4
imulq %rcx, %rax
- Take t2 * t5 and put the result into rax
- Multiplication is commutative but we would like our result to go into %rax because we want to return that value.

Interesting Instructions

leaq: address computation
salq: shift
imulq: multiplication
- But, only used once

Understanding Arithmetic Expression Example

Register	Use(s)
`%rdi`	Argument `x`
`%rsi`	Argument `y`
`%rdx`	Argument `z`
`%rax`	`t1`, `t2`, `rval`
`%rdx`	`t4`
`%rcx`	`t5`

Machine Programming I: Summary

History of Intel processors and architectures
- Evolutionary design leads to many quirks and artifacts
C, assembly, machine code
- New forms of visible state: program counter, registers, ...
- Compiler must transform statements, expressions, procedures into low-level instruction sequences
Assembly Basics: Registers, operands, move
- The x86-64 move instructions cover wide range of data movement forms
Arithmetic
- C compiler will figure out different instruction combinations to carry out computation