CPSC 355 Lecture Notes PDF

Summary

These lecture notes cover computer architecture, including CPU architecture, the fetch-execute cycle, data representation, assembly language, C programming, and different CPU architectures (Accumulator, Load/Store, RISC, CISC).

Full Transcript

CPSC 355 Lecture Notes GOODBYE MANZARA ❣️ ❣️ WE WILL ALWAYS REMEMBER YOU Course objectives: ​ Computer structure, i.e. architecture, specifically CPU architecture ​ Learn how a computer operates...

CPSC 355 Lecture Notes GOODBYE MANZARA ❣️ ❣️ WE WILL ALWAYS REMEMBER YOU Course objectives: ​ Computer structure, i.e. architecture, specifically CPU architecture ​ Learn how a computer operates ○​ fetch-execute cycle ​ Learn how data and instructions are represented internally ○​ Signed and unsigned ints ○​ Strings, chars ○​ Floats ○​ Machine instructions ​ Assembly ○​ Used for: ​ Embedded systems ​ OS kernels ​ Device drivers ​ Code generator as part of compiler ​ (rarely) Applications ○​ Helps to understand: ​ Computer architecture ​ Operating system ​ Writing efficient programs ​ Connection between high level languages and machine operation ​ C High-Level Architecture Computer system (most basic) consists of: ​ CPU ​ System clock ​ Primary memory (RAM) ​ Secondary memory (SSD, HDD, etc.) ​ Peripheral input and output devices ​ Bus Data transfer via bus diagram CPU(curved rect)--Clock (circle) ^v ← Bus → ^v​ ^v​ ^​ v Prim.​ Sec.​ Input​ Output (sqr)​ (c. sqr)​(c. rect)(c. rect) CPU ​ Executes instructions ​ Controls transfer of data across the bus ​ Usually contained on a single microprocessor chip ○​ Intel Core i5 ○​ APM883208-X1 ​ Applied Micro ○​ Apple A7 ​ Arm technology, branded with Apple name ​ Three main parts ○​ Control Unit (CU) ​ Directs execution of instructions ​ Loads an opcode from primary memory into the Instruction Register (IR) ​ Decodes the opcode to identify the operation ​ If necessary, transfers data between primary memory and registers ​ If necessary, directs the ALU to operate on data in registers ○​ Arithmetic Logic Unit (ALU) ​ Performs arithmetic and logical operations on data in registers ​ e.g. Add numbers from 2 source registers, then store the result in a third register. ​ e.g. Bitwise logic, like AND between two source registers. ○​ Registers ​ Binary storage units in the CPU ​ May contain: ○​ Data ○​ Addresses/pointers ○​ Instructions ○​ Status information ​ General-purpose registers are used by a programmer to temporarily hold data and addresses. ​ Lose information when power is turned off. ​ The Program Counter (PC) contains the address in memory of the currently executing instruction. ​ To progress to the next instruction, it is simply incremented. ​ The Status Register (SR) contains information (flags, which are bits) about the result of a previous instruction ​ e.g. overflow, carry System Clock ​ Generates a clock signal to synchronise the CPU and other clocked devices ○​ Is a square wave at a particular frequency ​ shape is like a wave but with only 90 degree angles ○​ Clock rate (speed) in GHz ​ GHz = billion cycles per second Primary Memory ​ Random access memory. A better term would be “direct access memory”, or “non-sequential access memory”. The word “random” is a historical term that does not really make sense. ​ Any byte in memory can be accessed directly with its address ​ volatile: data disappears when power is lost ​ Stores: ○​ Program instructions ○​ Program data (variables) ​ Consists of a sequence of addressable 1-byte memory locations. ○​ There are architectures with more than 1 byte per location, but these are rare ​ Architectures: ○​ von Neumann architecture: RAM contains both data and programs (instructions). ​ Will be using this on our ARM servers ○​ Harvard architecture: uses separate memories for data and programs. Bus ​ Copper wires on motherboard ​ Set of parallel data/signal lines ​ Transfers information between computer components ​ Often subdivided into: ○​ address bus ​ CPU to RAM ​ Specifies memory location in RAM (a positive integer) ​ Or sometimes, a memory-mapped I/O device. ​ Common sizes: 32 or 64 bits ○​ data bus ​ Communication between CPU and RAM ​ Common sizes: 32 or 64 bits ○​ control bus ​ Communication between CPU and RAM ​ Used to control or monitor devices connected to the bus ​ E.g. read/write signal for RAM ​ Handled at the level of hardware - we do not control this, and will not spend much time on it ○​ An expansion bus might be connected to the computer’s local bus ​ For attaching input and output devices ​ Bus standards: ​ USB ​ SCSI (“scuzzy”, older standard) ​ PCIe ​ CPU​ -address-> ​ Primary Memory ​ Secondary Memory ​ Holds a computer’s file system ​ Non-volatile memory ○​ Persists throughout power cycle ​ Usually embodied on a HDD or SSD Peripheral I/O Devices ​ Allow communication between computer and external environment ​ Input examples: ○​ Microphone (transducer, uses ADD (analog to digital converter) to convert sound waves into computer readable information) ○​ Mouse, keyboard, trackball, joystick, etc. ​ Output: ○​ Monitor, printer, speakers, etc. ​ I/O devices: ○​ HDD ​ Need to learn system io ○​ Modem ○​ Connections to networks Basic CPU Architectures Accumulator Machines CPU > ALU ​(in: ACC, data bus) ​ (out: ACC) > ACC ​(in: ALU) ​ (out: ALU) out: primary memory, via address bus i/o: primary memory, via data bus ​ Operands for an instruction come from the accumulator register (ACC) and from a single location in RAM ​ ALU results are stored in the ACC, overwriting the old results ​ The ACC can be loaded from or stored to the RAM Load/Store Machines CPU > ALU​ in: Register (2x)​ out: Register > Register​ in: ALU ​ Note: The “register file” is not a file, but a set of registers. Could be thousands. ​ Register file can be saved to or loaded from RAM. ​ Only load/store operations can access RAM ​ Other instructions operate on specified registers in the register file, not on RAM ○​ Physical distance to registers is much smaller than to RAM ​ Typical program sequence: ○​ Load registers from memory (Load instructions) ○​ Execute an instruction using two source registers, putting the result into a destination register. ○​ Store the result into memory (Store instruction) RISC and CISC Architectures RISC: Reduced Instruction Set Compiler ​ Uses only simple instructions that can be executed in one machine cycle. ​ Advantage: Enables faster clock rates, thus faster overall execution ​ Disadvantage: Programs require more instructions and become larger without access to complex instructions ​ Ex. Original SPARC architecture had no multiply instruction. ○​ Done using repetitive add-shift operations. ​ Machine instructions are always the same size ○​ Makes decoding simpler and faster ○​ ARMv8 instructions are always 32 bits wide CISC: Complex Instruction Set Compiler ​ May have instructions that take many cycles to execute ○​ Slows down clock rate but makes programming easier ​ e.g. Intel Core 2 ○​ add: 1 cycle ○​ subtract: 1 cycle ○​ multiply: 5 cycles ○​ division: 40 cycles ​ Machine instructions can vary in length, and may be followed by “immediate” data. ○​ Makes decoding difficult and slow ○​ ex: Intel x86 ​ Can be 1-15 bytes Instruction Cycle ​ Fetch-execute or fetch-decode-execute cycle ​ CPU executes each instruction in a series of small steps: ○​ Fetch the next instruction from memory into the instruction register (IR) ​ The program counter register (PC) stores the address ○​ Increment the contents of the PC to point to the next instruction ○​ Decode the instruction ○​ If the instruction uses an operand in RAM, calculate the address ○​ Fetch the operand ○​ Repeat previous two steps if necessary ○​ Execute the instruction ○​ Calculate address of RAM to store result in, if necessary. ○​ Store result Assembly Language Programs ​ Consists of a series of statements, each corresponds to machine instructions. ​ ARMv8 example: ​ add​ x20, x20, x21 ​ corresponds to: ○​ 1000 1011 0001 0101 0000 0010 1001 0100 ○​ 0x8b150294 ​ Each statement consists of an opcode, and a variable number of operands ○​ “add x20, x20, x21” → “opcode operand,operand,operand” ​ Instructions are stored sequentially in RAM, therefore each instruction has a unique address. ​ Optionally, a label can prefix any statement. ○​ Form: “label:​ statement” ○​ Is a symbol whose value is the address of the machine instruction.r ○​ May be used as a target for a branch instruction ​ Pseudo-ops (assembler directives) do not generate machine instructions, but give the assembler extra information. ○​ Form: “.” ○​ exe. “.global start” ​ Comments may follow a statement. For example, after a // delimiter. ​ Use indents to format label, opcode, operands, comments ​ Use an editor like emacs to automate this Assemblers ​ Translates assembly source code into machine readable language ​ In this course, we are using the “GNU as assembler”. ○​ Part of the GNU gcc compiler suite ​ To assemble source code: ○​ gcc myprod.s -o myprod ​ Assume that any file that ends in “.s” contains assembly language source code. Macro Preprocessors ​ Many assemblers support macros ○​ Allows you to define a piece of text with a macro name, ○​ Parameters can optionally be specified ○​ Macros: little bits of text that expand into larger commands ​ This text will be substituted inline wherever invoked, called macro-expansion ​ Improves readability ​ Unfortunately gcc has limited support for macros ○​ Therefore, we will use another macro preprocessor called m4 before invoking gcc ​ m4 is built into unix and is a standard unix command m4 e.g. define(coef, x23)​ // defines 2 macros. Constants named coef and z_r define(z_r, x18)​ // Named registers coef and z_r. //“z_r” means store z in a register instead of memory (conventional name).. add x19, z_r, coef​ // store the sum of.. // expands to:... add x19, x18, 23 // x19 is the name of a register, not touched) macro preprocessors (cont.) ​ General procedure: ○​ Put your src code containing macros into a file ending in.asm ○​ Invoke m4, redirecting output to a file ending in.s ○​ e.g. m4 pyprog.asm > myprog.s ○​ run gcc on the output file Readings and Exercises: P & H: Chapter 1 ​ Most of the text we’re dealing with is 2-3, especially 2.7. ARMv8 Architecture Introduction ​ This course uses the Gigabyte R152-P33 servers (two of them) ​ CPU is the Ampere Altra Q64-22 64-bit processor ○​ It is an implementation of the ARMv8-A specification. Licensed from ARM Holdings, PLC. ○​ In the name, -A: application. -M: microprocessor. ○​ ARM: Advanced RISC Machine (originally Acorn RISC Machine) ​ Installed OS is Linux Fedora 40. ○​ Includes up-to-date versions of gcc, as, gdh, and m4 ​ Access ssh with addresses: ○​ csa1.uc.ucalgary.ca​ (computer science arm 1) ○​ csa2.uc.ucalgary.ca ○​ Load balancer; csarm.ucalgary.ca ​ Puts you on whichever is less active Architecture ​ The ARMv8-A architecture: ○​ is a RISC ○​ is a Load/Store machine ​ register file contains 31 64-bit-wide registers ​ Most instructions manipulate 64-bit data stored in these registers. When we can get away with just 32-bit data, we only use the right half of the registers. ○​ uses von-Neumann architecture for RAM ○​ has two execution states: ​ AArch64: ​ Uses the instruction set called A64 with 64 bit registers. ​ Used exclusively in this course ​ AArch32: ​ Runs code developed on older versions of the ARM architecture. ​ Uses A32 or T32 instruction sets. ○​ Backwards compatibility with older ARM and THUMB instruction sets that use 32-bit instruction sets ○​ has four exception levels: ​ EL0: Application, Application, Application, Application ​ Normal user applications with limited privileges ​ Restricted access to a limited set of instructions and registers, as well as certain parts of memory ○​ Prevent interfering with OS ​ Most common ​ EL1: OS Kernel, OS Kernel ​ For the OS kernel ○​ Privileged access to instructions, registers, and memory. ○​ Accessed indirectly by user programs using system calls ​ EL2: Hypervisor ​ For a Hypervisor: ○​ above “supervisor” el1 ○​ Supports virtualization, allowing computer to host multiple guest operating systems on their own virtual machines. ​ EL3: Secure Monitor ​ Low level firmware ○​ Includes the Secure Monitor Registers ​ AArch64 has 31 64-bit-wide general purpose registers, ○​ Numbered 0-30 ○​ When using all 64 bits, use x or “X” before the number. ​ Stands for “extended” ​ e.g. x0, x1, …, x30 ○​ When using the low-order 32 bits of the register, use “w” or “W” ​ stands for “word” ​ e.g. w2, w29 ​ Many registers have special uses: ○​ x0 - x7: ​ Used to pass arguments into a procedure, then return results. ○​ x8: ​ Indirect result location register ○​ x9-x15: ​ Temporary registers ○​ x16-x17: ​ Intra-procedure-call temporary registers ○​ In documentation, often called IP0 and IP1. ○​ x18:​ Platform register (for some operating systems) ○​ x19-x28:​ Free registers (use for work we are doing) ○​ x29: ​ Frame pointer (FP) register ○​ x30:​ Procedure link register (LR) ​ x19-x28 are callee-saved registers. Means they are safe to use. ○​ Value is preserved by any function you call. ​ Special purpose registers: ○​ Stack Pointer: ​ Letters SP in code. ​ 64 bit wide register used in A64 code. ​ WSP: 32 bit for A32 code, ignore for now ​ Points at the top of the run-time stack ○​ Zero Pointer: ​ XZR: 64 bits wide ​ WZR: 32 bits wide ​ Gives 0 value when read from ​ Discards value when written to ○​ Program Counter ​ PC: 64 bits wide ​ Holds the address of the currently executing instruction ​ Can’t access directly as a named register ​ Is changed indirectly by branch instructions ​ Is used implicitly by pc-relative load/store instructions ​ Can be accessed in gdb with $pc ○​ Floating Point Registers ​ There are 32 128-bit-wide floating-point registers. ​ Discussed at the end of the course ○​ System registers (about 600) ​ Only accessible at EL1 (OS kernel) ​ Used in OS kernel code A64 Assembly Language ​ Consists of statements with one opcode and 0 to 4 operands. ○​ The allowed operands depend on the particular instruction ○​ In general: ​ First operand: destination register ​ The other three: source registers ​ E.g.​ add​ x19, x20, x21 ​ x19 is the destination, the others are source 1 and 2 ​ An immediate value (a constant) may be used as the final source operand for some instructions. ○​ E.g.​ add​ x19, x20, 42 ○​ A # symbol can prefix the immediate value, but is optional when using gcc. ○​ Constant range depends on the instruction, as the machine instruction determines the number of available bits. ○​ Immediates are assumed to be decimal numbers unless: ​ Hexadecimal: 0x ​ Octal numbers: 0 ​ e.g. 0777 ​ Binary numbers: 0b ​ e.g. 0b101 ​ Some instructions are aliases of other instructions ○​ e.g. ​ mov​ x29, sp ​ copy sp into x29 ​ is an alias for ​ add​ x29, sp, 0 ○​ sp = x29 + 0 ​ Commonly used instructions ○​ Note: Move instructions are more like copy instructions, because they do not affect the source register. ○​ Note: W is half of the 64 bit memory, X is all 64 bits ○​ Move immediate (32-bit) ​ Form:​ mov Wd, #imm32 ​ Wd: general purpose register ​ #imm32: immediate 32-bit value ○​ -2^31 to 2^32-1 ​ e.g. mov w20, -237 ○​ Move immediate (64-bit) ​ Form: mov Xd, #imm64 ​ Xd: destination register ​ #imm64: 64 bit immediate value, -2^63 to 2^64-1 ​ e.g. mov x21, 0xFFFE ○​ Move register (32-bit) ​ Form: mov Wd, Wm ​ destination, source registers ​ Alias for orr Wd, wzr, Wm ○​ Move register (64-bit) ​ e.g. mov x22, x20 ​ Note that you cannot use both 32 and 64 bit values. ○​ Branch and Link (bl) function ​ Can be a library or your own function ​ e.g. printf ​ e.g. bl label ​ e.g. bl printf ​ Arguments are put into x0-x7 before the call ​ Return value is in x0 Basic Program Structure ​ The main routine:.global main main: ​ stp ​ x29, x30, [sp. -16]! mov​ x29, sp​ // these three lines save the state of the calling code... ldp ​ x29, x30, [sp], 16 ret​ // previous one: restores state Memorize this!.global main ​ makes the label “main” visible to the linker ​ main() routine is where execution always starts …[sp, -16]! ​ Allocates 16 bytes in stack memory (in RAM) ​ Does so by pre-incrementing the SP register by -16 stp x29, x30, … ​ Stores the contents of the pair of registers to the stack ○​ x29: frame pointer (FP) ○​ x30: link register (LR) ○​ SP points to the location in RAM where we write to ​ Saves the state of the registers by calling code ​ x29 and x30 are both 8 bytes long hence allocating 16 bytes of memory mov x29, sp ​ Updates FP to the current SP ​ FP may be used as a base address in the routine ldp x29, x30, … ​ Loads the pair of registers from RAM ○​ SP points to the location in RAM where we read from ​ Restores the state of the FP and LR registers …[sp], 16 ​ Deallocates 16 bytes of memory ​ Does so by post-incrementing SP by +16 ret ​ Returns control to calling code (in OS) ​ Uses address in LR Basic Arithmetic Instructions Addition: ​ Uses 1 destination, 2 source ​ Register (64 or 32) ○​ e.g. add x19, x20, x21 // x19=x20+x21 ​ Immediate (64 or 32) ○​ e.g. add w20, w20, 2 // w20=w20+2 Subtraction: ​ Uses 1 destination 2 source operands ​ Register (32 or 64 bit): ○​ sub x0, x1, x2 //x0 = x1 -x2 ​ Immediate (...) Multiplication: ​ 1 destination and 2 or 3 source registers ​ No immediates allowed ​ Form (32 bit): mul Wd, Wn, Wm // Wd = Wn * Wm ​ Alias for: madd Wd, Wn, Wm, wzr ○​ wzr → zero register ​ Form (64-bit): ○​ e.g. mul x19, x20, x20 //square number Multiply-Add ​ Form (32-bit):​ ○​ madd​ Wd, Wn, Wm, Wa ​ Wd = Wa + (Wn * Wm) ○​ madd​ w20, w21, w22, w23 ​ w20 = w23 + (w21*w22) ​ Form (64 bit): ○​ madd​ x20, x0, x1, x20 ​ x20 = x20 + (x0 * x1) Multiply-Subtract ​ Form (32-bit): ○​ msub​ Wd, Wn, Wm, Wa ​ Wd = Wa - (Wn * Wm) Multiply-Negate ​ Form (32-bit):​ ○​ mneg​ Wd, Wn, Wm ​ Wd = - (Wn * Wm) Division ​ No immediates allowed ​ Signed form (32 or 64 bit): ○​ sdiv​ Wd, Wn, Wm ​ signed divide ​ operands are signed integers ​ Wd = Wn / Wm ​ Unsigned form (32 or 64 bit) ○​ udiv ​ These instructions do integer division; remainder is discarded ○​ The remainder (modulus) can be calculated using numerator - (quotient * denominator) ​ msub ​ destination, quotient, denominator, numerator ​ Dividing by 0 does not generate an exception ○​ Instead, writes 0 to the destination register Other variants exist and can be found in ARM documentation on D2L Printing to Standard Output ​ Call printf ○​ Standard function in the C library ○​ Invoked with ≥ 1 arguments ​ First is the format string (usually a literal “”) ​ The rest corresponds to the number of placeholders in the string ○​ Example C code: … int x = 42; printf(“Meaning of life = %d\n”, x); ​ Equivalent assembly code: ○​ fmt:​.string “Meaning of life = %d\n”​ // creates format string.balign 4​ // ensures instructions are aligned in memory.global main main: … adrp x0, fmt add x0, x0, :lo12:fmt​ // Arg 1: address of the string mov w1, 42 bl printf​ // function call … Branch Instructions and Condition Codes ​ a branch instruction transfers control to another part of the code ○​ Like a goto in the C language ​ PC register is not incremented as usual, but set to the computed address of an instruction ○​ Corresponds to the value of its label ​ An unconditional branch is always taken ○​ Form: b label ​ e.g. b top ​ condition flags store information about the result of an instruction ○​ Single bit units in the CPU (arm has no status register?) ​ Record process state (PSTATE) information ○​ Four flags: ​ Z: true if result is 0 ​ N: true if result is negative ​ V: true if result overflows ​ C: true if the result generates a carry out ○​ Condition flags are set by instructions that end in “s” (short for set flags) ○​ E.g. subs, adds ○​ subs may be used to compare two registers ○​ Eg: subs x0, x1, x2 ​ Also sets flags ○​ cmp is more intuitive ​ 64 bit:​ cmp Xn, Xm ​ alias for: subs xzr, Xn, Xm ​ e.g.: cmp x1, x2 ○​ Conditional branch instructions use those flags to make a decision ​ If a particular flag is true, we take the branch ​ i.e. “jump” to the instruction at the specified label ​ otherwise, control goes to the next instruction in the program ​ e.g.: b.eq top // jump to a line labeled top ​ branches if Z is true ​ Form: b., where cc is the condition code ​ For signed integers: Name Meaning C equivalent Flags Complement eq equal == Z == 1 ne ne not equal != Z == 0 eq gt greater than > Z == 0 && N == V le ge greater than or equal >= N == V lt lt less than < N != V ge le less than or equal to B){ // do stuff } → define(x19, a_r) define(x20, b_r) define(x21, c_r) define(x22, d_r) cmp a_r, b_r b.le skip if:​ // Do stuff // Do stuff skip​ // Do other stuff The if-else Construct ​ Is formed by branching to the else part if the condition is not true ○​ Use the logical complement ○​ If true, the code falls through to the if part ​ E.g.: C: if (a > b) { c = a+b; d = c+5; } else { c = a-b; d = c-5;​ } Assembly: define(a_r, x19) define(b_r, x20) define(c_r, x21) define(d_r, x22) cmp a_r, b_r b.le else add c_r, a_r, b_r add d_r, c_r, 5 b next else:​ sub c_r, a_r, b_r sub d_r, c_r, 5 next:​ … Introduction to the GDB Debugger ​ to start a program under debugger control:​ ○​ gdb ​ to set a breakpoint: ○​ b ​ to run program: ○​ r ○​ stops at the first breakpoint ​ use c to continue to the next breakpoint ​ use si to single step through program ​ use ni to proceed through the entire next instruction ○​ for example, printf is a function call, so rather than stepping into it, proceed through all of the instructions that are part of that function ​ display/i $pc to automatically show the current instruction ○​ display instruction at the program register ​ p $ to print a register ○​ can append a format character: ​ signed decimal: p/d ​ hexadecimal: p/x ​ binary: p/t (letter b is used somewhere else) ​ use q to quit gdb Binary Numbers and Integer Representations Binary numbers ​ Base 2 numbers ​ Use binary digits 0 and 1 ​ Easy to encode on a computer, since only two states need to be distinguished ○​ Using voltages: ​ 0 = 0 V ​ 1 = 3.3 V on modern machines, used to be 5 V on older machines ​ Using physical media, typically 0 represents default/unaltered/off state, whereas 1 represents acted upon/altered/on state ​ An n-bit size register can hold 2^n bit patterns ○​ 4 bit register can hold 16 distinct bit patterns Unsigned Integers ​ Encoded using simple binary numbers ​ Range: 0 to 2^n - 1, where n is the number of bits Signed Integers ​ Most commonly encoded using the two’s complement representation ​ Range: -2n-1 to 2n-1 - 1 ○​ 4 bits : -8 to 7 ○​ 8 bits: -128 to 127 ○​ 32 bits: -2,147,483,648 to 2,147,483,647 ○​ 64 bits: -263 to 263 - 1 ​ Negating a number is done by: ○​ taking the one’s complement ​ flip 0s to 1s ○​ add 1 to the result ​ Find the bit pattern for -5 in a 4-bit register ○​ 5 = 0101 ○​ → 1010 → 1011 = -5 ○​ In the other direction: ○​ 1011 → 1010 → 0101 = 5 ​ All positive numbers will have a 0 in the leftmost bit, all negatives will have 1 in the leftmost bit ○​ Sign bit ​ Sign-magnitude and one’s complement representations are also possible ○​ However, they are awkward to handle in hardware ○​ Have two zeros (+0, -0) Hexadecimal numbers ​ base 16 ​ Use 0, 1, 2, …, 9, A, B, C, D, E, F ​ Shorthand for denoting bit patterns ○​ 0xF5A = 1111 0101 1010 Octal numbers ​ base 8 ​ 0-7 ​ 3 bits ​ Each digit corresponds to a 3-bit pattern ​ e.g. 0756 → 111 101 110 Integer classes and subtypes ​ Linux on ARMv8 in AArch65 uses the LP64 data model ○​ Long integers and Pointers are always 64 bits long ○​ A64 Keyword Size in bits C keyword byte 8 char (misleading -- you can also store ints and units in here) halfword 16 short int word 32 int doubleword 64 long int, void * (any kind of pointer) quadword 128 n/a ​ In C, use the keyword unsigned to denote unsigned integers ○​ e.g. unsigned int x;​ // 32 bits ○​ e.g. unsigned char y​ // 8 bits Bitwise Operations Bitwise Logical Instructions ​ Manipulate bits in a register ​ a b a AND b 0 0 0 0 1 0 1 0 0 1 1 1 ​ Form (64-bit): ○​ and Xd, Xn, Xm ​ e.g. ​ mov x19, 0xAA​ // 1010 1010 ​.​ mov x20, 0xF0​ // 1111 0000 ​.​ and x21, x19, x20​ // 1010 0000 ○​ Note that 0xF0 forms a bitmask: lets certain things through ○​ 32 bit and immediate versions also exist with w ​ e.g. ​ mov w19, 0x55​ // 0101 0101 ​.​ and w19, w19, 0xF​ // 0000 1111 → 0000 0101 ​ ands sets or clears N and Z flags according to the result. V and C are always clear. ○​ e.g. test if bit 3 is set in x20 ​ fourth bit from the right ​ Is there a 0 or 1 in there? ​ bitmask: 1000 = 0x8 ​ … ​.​ ands​ x19, x20, 0x8 ​.​ b.eq​ bitclear​ // if it equals 0, jump to bitclear ​ bitset:​ … ​ bitclear:... Missing notes!!! I was away for a couple of weeks so the notes are incomplete here. Make sure to read the binary logic and memory and stack slides. Basic Load and Store Instructions, cont. ​ Store Byte ○​ Form (32-bit only): strb Wt, addr ○​ stores low-order byte in Wt to RAM ​ Store Halfword ○​ Form (32-bit only): strh Wt, addr ○​ Stores low-order halfword (2 bytes) in Wt into RAM Load/Store Address Modes ​ Pre-indexed by immediate offset ○​ Form: [base, #imm]! ​ base: X register (typically x29, the frame pointer) or SP ​ #imm: A 9-bit signed constant (range: -256 to 255) ○​ base is first updated by adding immediate to it ​ This address is then used for load/store ○​ Example: e.g.​ … add​ x28, x29, x16​ // x28 = frame pointer + 16 str​ w20, [x28, 8]! … ​ Post-indexed by immediate offset ○​ Form: [base], #imm ​ base: an X register, typically x29, or SP ​ #imm: 9-bit signed constant (range -256 to 255) ○​ The address in base is used for load/store ​ Base is updated after ○​ E.g. … add​ x28, x29, 16 str​ w20, [x28], 8 Stack Variable Offset Macros ​ m4 macros can be used for offsets to improve readability ​ E.g. ○​define (a_s, 16) // When using stack memory, we append _s to the name ○​define (b_s, 20) ○​define (c_s, 24) ○​define (d_s, 28) ○​… ​ str w20, [x29, a_s] ​ ldr w21, [x29, b_s] ​ Can also be done with assembler equates (not related to m4) ​ e.g.: ○​ a_s = 16 ○​ b_s = 20 ○​ … ○​ str w20, [x29, a_s] ○​ ldr w21, [x29, b_s] ​ Register equates are useful for renaming x29 (FP) and x30 (LR, which we haven’t used much). ○​ e.g. fp.req x29​ // register equate fp to x29 lr.req x30​ // register equate lr to x30 … str w20, [fp, a_s] Local Variables ​ In C, can be declared in a block of code. ​ i.e. in any construct delimited by {...} ​ e.g. int main() { int a = 5, b = 7; ​ // local to main … if (a < b) { int c;​ // local to if-construct } } ​ Implemented in assembly as stack variables ○​ Those local to the function are allocated upon entering the function, as shown previously? ○​ Those local to other blocks of code are: ​ Allocated when entering the block ​ Done by decrementing the SP to allocate memory on the top of the stack ​ Are read or written using FP plus a negative offset ​ Deallocated when leaving the block ​ Done by incrementing SP ○​ E.g. Equivalent assembly code a_s = 16 b_s = 20 c_s = -4 … main:​ stp​ x29, x30, [sp, -(16 + 8) & -16]! // 16: frame record // 8: two variables, a and b, which are regular 4-byte integers // Add those numbers and negate as usual // & with -16 allocates the appropriate number of bytes mov ​ x29, sp mov​ w19, 5​ // init a to 5 str​ w19, [x29, a_s] mov ​ w20, 7​ // init b to 7 str ​ w20, [x29, b_s] cmp​ w19, w20 b.ge next // start of block of code for if-else add​ sp, sp, -4 & -16​ // alloc RAM for c // bitwise & with -16 to align properly mov​ w21, 10​ // init c to 10 str​ w21, [x29, c_s] … // end of block of code add​ sp, sp, 16 next: ​ … ​ After allocating RAM for C, the stack appears as: Free memory (many bytes) SP → Pad bytes (12B)​ | c (4B)​ | FP →​ | Frame record​ |→​ Stack frame for main (when we are in the a (4B)​ |​ middle of the if construct after allocating b (4B)​ |​ memory for c) Pad bytes (8B)​ | Stack frame ​ Note that, as shown in this visual, we always allocate in groups of 16B ​ Readings: ​ P & H: Section 2.3, p. 166 - 168 ​ ARMv8 Instruction Set Overview: ○​ Section 4.5 ○​ Section 5.3.1 Data Structures One-Dimensional Arrays ​ Must be done manually ​ Store consecutive array elements of the same type in a contiguous block of memory ○​ Block size = * ○​ Address of element i is: ​ + (i * ) ​ E.g.: int ia; ​ // 1D int array with size 5 // Each int is 4 B long, so we need 20 B ​ Like other local variables, an array is allocated in the function’s stack frame ​ e.g. C code: int main() { int a, b, ia; … } ​ Assembly code: a_size = 4​ // a is 4 bytes b_size = 4​ ia_size = 5 * 4​// 5 elements in the array, each is 4 bytes long alloc = -(16 + a_size + b_size + ia_size) & -16 // We always need 16 bytes for the frame renderer. Add the size of each // element to that. Negate the result of the additions. Use & -16 to align to 16 // bytes. dealloc = -alloc a_s = 16​ // offset is 16 bytes for a because of 16 bytes for frame r. b_s = 20​ // offset is 20 bytes ia_s = 24​ // main:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp … lp​ x29, x30, [sp], dealloc ret ​ Memory is used as follows: Address​ Item Free memory SP → FP →​ Frame record (16B)​ | fp+b_s​ a (4B)​ | fp+b_s​ b (4B)​ | fp+ia_s+0​ ia (4B)​ | -- Stack frame for main fp+ia_s​+4​ ia (4B)​ | …​ | fp+ia_s+16​ ia (4B)​ | pad bytes​ |​ // because of “& -16” ​ Array elements are accessed using load and store instructions ​ E.g.: ia = 13; define(ia_base_r, x19)​ // where the array starts in memory define(index_r, x20) define(offset_r, x21) … add​ ia_base_r, x29, ia_s​ // calc array base address mov​ index_r, 2​ // set index to 2 lsl​ offset_r, index_r, 2​ // offset = index * 4 mov​ w22, 13 str​ w22, [ia_base_r, offset_r]​ // ia = 13 ​ Can be optimised by using the LSL in the STR instruction: mov​ index_r, 2 mov​ w22, 13 str​ w22, [ia_base_r, index_r, LSL 2]​ // ia = 13 ​ If index is a w register, one must sign extend first: define(index_r, w20) … str​ w22, [ia_base_r, index_r, SXTW2]​ // ia = 13 // sign extend w Calling random ​ bl rand ○​ Result appears in w0 ○​ Pseudo-random number generator, sequence is always the same when running the program Multidimensional Arrays ​ Use 2 or more indices to access individual array elements. ​ Most languages use row major order when storing arrays in RAM ○​ i.e. first all elements of row 0, then 1, etc. ​ e.g. int ia 24 byte block:​ Address Memory ia+0 ia ia+4 ia ia+8 ia ia+12 ia ia+16 ia ia+20 ia ​ Block size for an n-dimensional array: 𝑛 ○​ ∏ 𝑑𝑖𝑚𝑟 × 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑠𝑖𝑧𝑒 𝑟=1 ○​ where dimr is the number of elements in dimension r 𝑏 𝑏 ○​ the ∏ is just ∑ for multiplication 𝑎 𝑎 ○​ e.g. int ia ​ block size: 2 * 3 * 3 * 4 bytes = 72 bytes ​ The offset for a given element is: 𝑛−1 𝑛 ⎡ ⎡ ⎤ ⎤ ○​ ⎢ ∑ ⎢𝑖𝑛𝑑𝑒𝑥𝑟 × ∏ 𝑑𝑖𝑚𝑟⎥ + 𝑖𝑛𝑑𝑒𝑥𝑛⎥ × 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑠𝑖𝑧𝑒 ⎢ 𝑟=1 ⎢ ⎥ ⎥ ⎣ ⎣ 𝑟+1 ⎦ ⎦ ​ where index is the index for the element in dimension r ○​ e.g. a 3-dimensional array would be declared with: ​ type array[dim1][dim2][dim3] ​ Its offset for element array[index1][index2][index3] is: ​ ((𝑖𝑛𝑑𝑒𝑥1 * 𝑑𝑖𝑚2 * 𝑑𝑖𝑚3) + (𝑖𝑛𝑑𝑒𝑥2 * 𝑑𝑖𝑚3) + 𝑖𝑛𝑑𝑒𝑥3) * 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑠𝑖𝑧𝑒 ​ int ia: ​ offset for ia[i][j][k] would be: ​ ((i*3*4)+(j*4)+k)*4 ​ Six arithmetic operations ○​ Makes high dimensional arrays slow ​ Eg: C code for 2D array int main() { int ia; register int i, j;​// keyword in C; hint, if possible, instead of allocating in // the stack, just use a register for each. … ia[i][j] = 13; … } ​ Assembly: define(ia_base_r, x19)​ define(offset_r, w20) define(i_r, w21) define(j_r, w22) dim1 = 2 dim2 = 3 ia_size = dim1 * dim2 * 4 alloc = -(16 + ia_size) & -16​ // 16 for frame record, -16 to guarantee align. dealloc = -alloc ia_s = 16​ // ia comes immediately after frame record … main:​ stp​ x29, x30, [sp, alloc]!​ // frame pointer, link register, stack pointer mov​ x29, sp … add​ ia_base_r, x29, ia_s​ // array base address = fp + offset mov ​ w23, dim2 mul ​ offset_r, i_r, w23​ // offset = (i * dim2) add​ offset_r, offset_r, j_r​ // offset = (i * dim2) + j lsl​ offset_r, offset_r, 2​ // offset = ((i * dim2) + j) * 4 mov​ w24, 13 str​ w24, [ia_base_r, offset_r, sxtw]​ // ia[i][j] = 13 // optimise using lsl in str Structures ​ Contain fields of different types ​ Allocated in a single block of memory on the stack ○​ Fields are accessed using offsets ​ Eg: struct rec {​ // structure tag (rec short for record, but it’s unimportant) int a; char b; short c;​ }; Memory​ Offset a: 4 bytes​ rec_a = 0 b​ // rec comes from the structure tag b: 1 byte​ rec_b = 4 bytes c: 2 bytes​ rec_c = 5 bytes ​ If the base address is in x19, the field are accessed with: ○​ ldr​ w20, [x19, rec_a] ○​ ldrsb​ w21, [x19, rec_b]​ // ldr signed byte ○​ ldrsh​ w22, [x19, rec_c]​ // ldr signed halfword ​ Eg: struct rec { int a; char b;​ ← Defines a custom data type (struct) short c;​ Does not allocate memory, do that in main }; int main() { struct rec s1;​ ← allocates memory for local variable s1 … s1.a = 42; s1.b = 23; s1.c = 13; … } Assembly equivalent: define(s1_base_r, x19) rec_a = 0 rec_b = 4 rec_c = 5 s1_size = 7 alloc = -(16 + s1_size) & -16 dealloc = -alloc s1_s = 16​ // right after frame record … main:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp … add ​ s1_base_r, x29, s1_s​ // calculate struct base address mov​ w20, 42 str​ w20, [s1_base_r, rec_a]​ // s1.a = 42 mov​ w20, 23 strb​ w20, [s1_base_r, rec_b]​ // s1.b = 23 mov​ w20, 13 strh​ w20, [s1_base_r, rec_c]​ // s1.c = 13 Nested Structures ​ A structure may contain a field whose type is another structure ○​ The field’s subfields are accessed using a second set of offsets. E.g. C code (Custom struct to record date) struct date { unsigned char day, month unsigned short year​ }; struct employee { int id; struct date start;​ // nested struct }; int main(){ struct employee joe; joe.id = 4001; joe.start.day = 1; joe.start.month = 6; joe.start.year = 1999; } Offset Nested offset Memory Variable 0 - 4001 (4B) joe.id 4 0 1 (1B) joe.start.day - 1 6 (1B) joe.start.month - 2 1999 (2B) joe.start.year Assembly code: define(joe_base_r, x19) date_day = 0 date_month = 1 date_year = 2 employee_id = 0 employee_start = 4 joe_size = 8 alloc = -(16 + joe_size) & -16 dealloc = -alloc joe_s = 16​ // why? main:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp … add ​ joe_base_r, x29, joe_s // id mov ​ w20, 4001 str​ w20, [joe_base_r, employee_id]​ // joe.id = 4001 // date mov ​ w20, 1 strb​ w20, [joe_base_r, employee_start + date_day] // joe.start.day = 1 // month mov​ w20, 6 strb​ w20, [joe_base_r, employee_start + date_month] // joe.start.month = 6 // year mov​ w20, 1999 strh​ w20, [joe_base_r, employee_start, date_year] // joe.start.year = 1999 // Note: The assembler evaluates the expression employee_start + date_year, producing a constant for runtime. Subroutines ​ Basically functions ​ Subroutines allow you to repeat a calculation using varying argument values ​ Two types: ○​ Open (inline) // Insert the code inline wherever the subroutine is invoked ​ Usually a macro preprocessor ​ Arguments are passed in/out using registers ​ Efficient, since overhead of branching and returning is avoided ​ Suitable only for fairly short subroutines because code can grow large ○​ Closed ​ Machine code for the routine appears only in RAM ​ More compact machine code ​ When invoked, control jumps to the first instruction of the routine ​ i.e. PC loaded with the address of the first instruction ​ When finished, returns control to the next instruction in the calling code ​ i.e. The PC is loaded with the return address. ​ Arguments are placed in registers or on the stack. ​ Slower than inline routines ​ Subroutines should not change the state of the machine for the calling code. ○​ When invoked, a subroutine should save any registers it uses on the stack. ○​ When returning, it should restore those values ​ Arguments to subroutines are considered local variables ○​ Subroutine may change their values Open (Inline) Subroutines ​ Usually implemented using macros ​ E.g. cube function define(comment) comment(cube(1 = input register, 2 = output register)) define(cube, `mul​ $2, $1, $1 mul​ $2, $1, $2`) // Must use the ` quotation mark (top left of keyboard).global main main: ​ stp​ x29, x30, [sp, -16]! … mov​ x19, 8 cube(x19, x20)​ // x19 = input, x20 = output … ​ Macro expands this to: … mov​ x19, 8 mul​ x20, x19, x19 mul​ x20, x20, x19 ​ General form: label:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp … ldp​ x29, x30, [sp], dealloc ret ​ label: names the subroutine ​ alloc: number of bytes (negated) to allocate for the subroutine stack frame ○​ SP must be quad-word aligned ○​ Minimum of 16 bytes ​ For frame record in stack frame Subroutine Linkage ​ May be invoked using the branch and link instruction: bl ○​ form: bl​ subroutine_label ○​ Stores the return address in the link register, x30 ​ Return address is PC+4, the next instruction after the bl instruction ​ Use ret instruction to return from a subroutine to the calling code ​ E.g. C code: int main { … func1(); … } void func1(){ … func2(); … } void func2(){ … } ​ Assembly code: main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp … bl ​ func1 … ldp ​ x29, x30, [sp], 16 ret func1:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp … bl ​ func2 … ldp ​ x29, x30, [sp], 16 ret func2:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp … ldp ​ x29, x30, [sp], 16 ret ​ 16 bytes are for the frame record, which includes the previous frame pointer and link register. ​ The stp instructions create a frame record in each function’s stack frame. ○​ store LR (x30) in case it is changed by a bl in the body of the function ○​ restored by the load pair instruction ​ The FP and stored FP values in the frame records form a linked list ○​ The stack while func2 is executing looks like: Free memory Saving and Restoring Registers ​ A called function must save/restore the state of the calling code ○​ If it uses any registers in x19 - x28, it must save their values to the stack at the beginning of the function ​ called “callee-saved registers” ○​ Function must restore the data in these registers before returning ​ E.g. x19_size = 8 alloc = -(16 + x19_size) & -16 dealloc = -alloc x19_save = 16​ // offset func2:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp str ​ x19, [x29, x19_save] … mov ​ x19, 13 … ldr​ x19, [x29, x19_save] ldp​ x29, x30, [sp], dealloc ret ​ Note that the callee can also use register x9 - x15 ○​ By convention, these registers are not saved/restored by the called function ○​ Only safe to use in calling code between function calls ○​ The calling code can save these registers to the stack, if it is necessary to preserve their values over a function call ​ “caller-saved registers” Arguments to Subroutines ​ 8 or fewer arguments can be passed into a function using the registers x0 - x7 ○​ ints, short ints, and chars use w0 - w7 ○​ long ints use x0 - x7 ​ E.g. C code void sum(int a, int b) { register int i; i=a+b … } int main(){​ sum(3,4); … } ​ Assembly code define (i_r, w9) sum:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp add​ i_r, w0, w1​ ← w0 is first arg, w1 is second arg ldp​ x29, x30, [sp], 16 ret multiplication 5*3 M=5 T=0 N=3 iteration M T N 0 0101 0000 0011 1 0101 0101 0011 0101 0010 1001 2 0101 0111 1001 0101 0011 1100 3 0101 0011 1100 0101 0001 1110 4 0101 0001 1110 0101 0000 1111 last step 00001111 = 15 = 5*3 👍 M = 2, T = 0, N = -5 iteration M T N 0 0010 0000 1011 1 0010 0010 1011 0010 0001 0101 2 0010 0011 0101 0010 0001 1010 3 0010 0001 1010 0010 0000 1101 4 0010 0010 1101 0010 0001 0110 Since original multiplier is negative, subtract multiplicand from product i.e. T = T - M 0001 - 0010 = 0001 + 1110 = 1111 result: 11110110 = -00001010 = -10 main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp mov​ w0, 3 // set up first arg mov​ w1, 4 // set up second arg bl sum … ldp​ x29, x30, [sp], 16 ret // Subroutine regards x0-x7 as local sum:​ Pointer Arguments ​ In calling code, the address of a variable is passed to the subroutine ○​ Implies that the variable must be in RAM, not a register ​ The called subroutine dereferences the address, to manipulate the variable being pointed to ○​ Usually with a ldr or str instruction ​ ​ E.g. C code int main(){ int a = 5, b = 7; swap(&a, &b)​ // “address of” operator … } void swap(int *x, int *y){​ // x and y are pointers to some integer register int temp; temp = *x;​ // de-referencing *x = *y; *y = temp; } ​ Assembly code: a_size = 4 b_size = 4 alloc = -(16 + a_size + b_size) & -16 dealloc = -alloc a_s = 16 b_s = 20 define(temp_r, w9) … main:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp mov​ w19, 5​ // init a = 5 str​ w19, [x29, a_s] mov ​ w20, 7​ // init b = 7 str ​ w20, [x29, b_s] add ​ x0, x29, a_s​ // set up first arg​ (address a) add ​ x1, x29, b_s​ // set up second arg​ (address b) bl swap … swap:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp ldr​ temp_r, [x0]​ // temp = *x ldr​ w10, [x1]​ // w10 = *y str​ w10, [x0]​ // *x = w10 str​ temp_r, [x1]​ // *y = temp ldp​ x29, x30, [sp], 16 ret Returning Integers ​ A function returns: ○​ long ints in x0 ○​ ints, short ints, chars in w0 ​ E.g. Cube function in C int cube(int x);​ // function prototype int main(){ register int result; result = cube(3); } int cube(int x){ return x * x * x;​ } ​ Assembly code:​ define(result_r, w19) main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp mov​ w0, 3​ // first arg bl cube mov​ result_r, w0 // put returned result in w0 ldp​ x29, x30, [sp], 16 ret cube:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp mul w9, w0, w0​ // w9 = x*x mul w0, w9, w0​ // w0 = w9 * w0 = x*x*x ldp​ x29, x30, [sp], 16 ret Returning Structures ​ In C, a function may return a struct by value ​ E.g.: struct mystruct{ long int i; long int j; }; struct mystruct init(){​ // function returns a struct called “mystruct”​ return lvar struct mystruct lvar;​ // lvar is local variable lvar.i = 0; lvar.j = 0; } int main(){ struct mystruct b; b = init(); … } ​ Usually, a structure is too large to return in x0 or w0 ○​ So, we return it in memory ○​ Calling code provides memory on the stack to store the returned result ​ The address of this memory is put into x8 prior to the function call ​ x8 is called the “indirect result location register”. ○​ The called subroutine writes to memory at this address, using x8 as a pointer for it ​ Example: mystruct_i = 0 mystruct_j = 8 b_size = 16 alloc = -(16 + b_size) & -16 dealloc = -alloc b_s = 16 main:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp add ​ x8, x29, b_s​ // calculate address of b bl init ldp​ x29, x30, [sp], dealloc ret define(lvar_base_r, x9) lvar_size = 16 alloc = -(16 + lvar_size) & -16 dealloc = -alloc lvar_s = 16 init:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp // Calculate lvar struct base address add ​ lvar_base_r, x29, lvar_s str​ xzr, [lvar_base_r, mystruct_i]​ // i = 0 str​ xzr, [lvar_base_r, mystruct_j]​ // j = 0 ldr​ x10, [lvar_base_r, mystruct_i]​// set in main str​ x10, [x8, mystruct_i]​ // b.i = lvar.i ldr​ x10, [lvar_base_r, mystruct_j] str​ x10, [x8, mystruct_j]​ // b.j = lvar.i ldp​ x29, x30, [sp], dealloc ret Optimising Leaf Subroutines ​ Leaf subroutines do not call other subroutines ○​ if there is no bl in a subroutine, it is a leaf ○​ Leaf subroutines are leaf nodes on a structure diagram ​ A frame record is not pushed onto the stack​ ○​ Since the routine does not do a bl, the LR won’t change ○​ Since the routine does not call a subroutine, the FP won’t change ○​ Thus we can eliminate the usual stp/ldp instructions ​ If one only uses the registers between x0-x7 and x9-x15, a stack frame is not pushed at all ○​ No need to save/restore registers ​ E.g. optimised cube function cube:​ mul​ w9, w0, w0​ // w9 = x^2 mul​ w0, w9, w0​ // w9 = x^3 ret Subroutines with > 8 Arguments ​ Arguments beyond the 8th are passed on the stack ​ The calling code allocates memory at the top of the stack, then writes the “spilled” arguments there ○​ By convention, each argument is allocated 8 bytes no matter its size. ​ Similar to how registers are always 8 bytes even if used for ints ○​ Callee reads this memory using the appropriate offset ​ E.g. C code: int main() { register int result; result = sum(10, 20, 30, 40, 50, 60, 70, 80, 90, 100); } int sum(int a1, int a2, int a3, int a4, int a5, int a6, int a7, int a8, int a9, int a10) { return a1+a2+a3+a4+a5+a6+a7+a8+a9+a10; } ​ Assembly code: define(result_r, w19) spilled_mem_size = 16 alloc = -spilled_mem_size & -16 dealloc = -alloc.global main main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp mov ​ w0, 10 mov​ w1, 20 mov ​ w2, 30 mov ​ w3, 40 mov​ w4, 50 mov​ w5, 60 mov​ w6, 70 mov​ w7, 80 // Allocate memory for args 9 and 10 add​ sp, sp, alloc // Write spilled arguments to the top of the stack mov​ w9, 90 str​ w9, [sp, 0] mov​ w9, 100 str​ w9, [sp, 8] // Call sum function bl ​ sum mov ​ result_r, w0 // Deallocate memory for spilled args add​ sp, sp, dealloc ldp​ x29, x30, [sp], dealloc ret arg9_s = 16 // starts at 16 because of frame pointer arg10_s = 24 sum:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp add​ w0, w0, w1 add​ w0, w0, w2 add​ w0, w0, w3 add​ w0, w0, w4 add​ w0, w0, w5 add​ w0, w0, w6 add​ w0, w0, w7 // Read from 9th and 10th args ldr​ w9, [sp, arg9_s] add​ w0, w0, w9 ldr​ w9, [sp, arg10_s] add​ w0, w0, w9 ldp​ x29, x30, [sp], 16 ret Variables Local and static local ​ In C, local (automatic) variables are always allocated in the stack frame for a function ○​ Scope: local to block code where declared ○​ Lifetime: life of the block of code ​ Static local variables are different ○​ Scope: block of code where declared ○​ Lifetime: life of the program ​ persist from call to call of the function ​ E.g. C code: int count(){ static int value = 0 return ++value; } ​ Cannot be stored on the stack frame for the function ​ Stored in a separate section of RAM Global ​ Scope: global (from declaration onwards) ​ E.g.:​ int val; ​// ← global variable, not inside any function main(){ val = 3; … } int f(){ int a; a = val; } ​ Stored in a separate section of memory Static global ​ Scope: local to file (from declaration onwards) ​ Lifetime: life of program ​ Stored in a separate section of RAM text, data,.bss Sections ​ Programs may allocate 3 sections of memory: ○​ text ​ Contains: ​ Program text (machine code)​ Read-only, programmer-initialised data ​ Is read-only memory ​ Attempts to write to this memory cause a segmentation fault ○​ data ​ Read/write memory ○​ bss ​ block starting symbol ​ Contains zero-initialised data (e.g. i = 0) ​ Is read/write memory ​ These sections are located in low memory, just after the section reserved for the OS kernel. Low OS text​ ← PC data bss heap v free memory ← SP stack ^​ ← FP High ​ Pseudo-ops indicate that what follows goes in a particular section of memory ○​.text ​ Default section when assembling ○​.data ○​.bss ​ You can alternate between them ​ Assembler uses a location counter for each section ○​ Starts at 0, increases as instructions and data are processed ○​ Final step of assembly gathers all code and data into appropriate sections ​ When the OS loads the program into RAM: ○​ text, data loaded first ○​ bss zeroed External variables ​ Non-local variables, allocated in data or bss ○​ Used to implement global and static local variables as external variables ​ Can be allocated and initialised using the pseudo-ops: ○​.dword ○​.word ○​.hword ○​.byte ​ General form: ○​ label:​ pseudo-op​ value[, value2, …] ​ E.g.:.data a_m:​.hword​ 23​ // _m for things in memory (data or bss) b_m:​.word​ (11*4) - 2 c_m:​.dword​ 0 arr_m:.byte​ 10, 20, 30 ​ Allocates 17 bytes in the data section: ​ The labels represent 64-bit addresses ○​ Use adrp and add to put teh address into a register ○​ Then use ldr or str to access the variable ○​ E.g. C code: int i = 2, j = 12, k =0 ; int main() { k=i+j … } ○​ Assembly code:.data i_m:​.word​ 2​ // i (var in C code) _m (using memory) j_m:​.word​ 12 k_m:​.word​ 0.text.balign 4.global main main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp adrp​ x19, i_m add​ x19, x19, :lo12:i_m ldr​ w20, [x19]​ // w20 = i adrp​ x19, j_m add​ x19, x19, :lo12:j_m ldr​ w21, [x19]​ // w21 = j add​ w22, w20, w21​ // w22 = i+j adrp​ x19, k_m add​ x19, x19, :lo12:k_m str​ w22, [x19]​ // k = w22.skip ​ Uninitialized space can be allocated with the.skip pseudo-op​ ○​ Eg: 10 element int array myarray:​.skip 10*4 ​ Use.global if the variable is to be made available to other compilation units ○​ Eg:.global​myvar_m ​ The bss section usually only uses the.skip pseudo-op ○​ All bss memory is zeroed before the program executes ​ Initializing memory to non-zero values (.word,.hword, etc.) doesn’t make sense ○​ Eg:.bss array_m:​.skip​ 10*4​ // int array c_m:​.skip​ 1​ // char h_m:​.skip​ 2​ // short int (halfword) ​ Programmer-initialised constants are put into the.text section ○​ Before or between functions; not in the body ○​ Eg:.text.balign 4 func:​ stp ​ … … ret con_m:.hword 42​ // cannot be overwritten; constant_m.balign 4 func2:​ stp​ … … ret ASCII Character Set ​ American standard code for information interchange ○​ Encodes characters using 7 bits, stored in a byte ○​ www.lookuptables.com ○​ A: ​ 0x41 ○​ Z: ​ 0x5A ○​ a: ​ 0x61 ○​ z: ​ 0x7A ○​ 0: ​ 0x30 ○​ 9: ​ 0x39 ○​ Space: 0x20 ○​ Tab:​ 0x9 ○​ Period:​0x2e ​ In assembly, character constants can be denoted with: ○​ hex code ​ Eg: mov​ w19, 0x5A​ // Z ○​ the character in single quotes ​ Eg: mov​ w19, ‘Z’​ // stores 0x5A ​ Note: may interfere with m4 ○​ In gdb, use p/c $w to print character Creating and addressing string literals ​ A string is an array of characters ​ Could be initialised in memory one byte at a time ○​ e.g. “cheers” ​.byte​ ‘c’, ‘h’, ‘e’, ‘e’, ‘r’, ‘s’ ​ Or use.ascii pseudo-op ​.ascii​ “cheers” ​ In C, strings are null terminated ○​ i.e. c h e e r s \0 ○​ Could be done using two pseudo-ops:.ascii​ “cheers”.byte ​ 0 ○​ Or more conveniently with.asciz or.string:.asciz ​ “cheers” ​ A string literal is a read-only array of characters, allocated in the text section ○​ In C code, delimited with “” ○​ Eg: int main() { printf(“Hello, world!\n”); } ○​ The literal usually has a label, which represents the address of the first character in the array.text fmt:​.string​ “Hello, world!\n”​ // string literal.balign ​4.global ​main main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp adrp​ x0, fmt​ // address of first char ‘H’ add​ x0, x0, :lo12:fmt​ // now in x0 bl printf < Notes missing here> External Array of Pointers ​ Created with a list of labels ​ E.g. C code: #include // Array of pointers to string literals char *season = (“spring”, “summer”, “fall”, “winter”); int main(){ register int i; for (i = 0; i < 4; i++){ printf(“season[%d] = %s\n”, i, season[i]); } return 0; } ​ Equivalent assembly code define(i_r, w19) define(base_r, x20).text fmt:​ string “season[%d] = %s\n” spr_m:​.string “spring” sum_m:​.string “summer” fal_m:​.string “fall” win_m:​.string “winter”.data​ // create array of pointers in data section.balign 8​ // must be double-word aligned; // ​ not strictly necessary in armv8 season_m:​.dword​spr_m, sum_m, fal_m, win_m.text.balign ​4.global ​main main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp mov​ i_r, 0 b​ test top:​ adrp​ x0, fmt add​ x0, x0, :lo12:fmt​ // set up 1st arg mov​ w1, i_r​ // set up 2nd arg adrp​ base_r, season_m​ // calc aray base address add​ base_r, base_r, :lo12:season_m ldr​ x2, [base_r, i_r, sxtw 3]​ // set up 3rd arg bl ​ printf add ​ i_r, i_r, 1 test:​ cmp​ i_r, 4 b.lt​ top ldp​ x29, x30, [sp], 16 ret Command Line Arguments ​ Allow you to pass values from the shell into your program ​ In C: main(int argc, char *argv[]) ○​ argc: argument count (number of arguments) ○​ argv: argument vector (array of pointers to the arguments) ​ Eg C code: myecho.c int main(int argc, char *argv[]){ register int i; for (i=0; i < argc; i++){ printf(“%s\n”, argv[i]); } return 0; } ​ Sample run prompt>./myecho one two./myecho one two ​ Assembly code // Note: argc is in w0 and argv is in x1 for main() define(i_r, w19) define(argc_r, w20) define(argv_r, x21) fmt:​.string​ “%s\n”.balign ​4.global​main main:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp mov​ argc_r, w0​ // copy argc to safe register mov​ argv_r, x1​ // copy argv to safe register mov ​ i_r, 0 b​ test top:​ adrp​ x0, fmt add​ x0, x0, :lo12:fmt​ // set up 1st arg ldr​ x1, [argv_r,, i_r, sxtw 3]​ // set up 2nd arg bl ​ printf add​ i_r, i_r, 1 test: ​ cmp ​ i_r, argc_r b.lt ​ top ldp ​ x29, x30, [sp], 16 ret Separate Compilation and Linking ​ Using two different assembly code files ​ Assembly code: first.s.balign​4.global​main main:​ stp​ x29, x30x, [sp, -16]! mov​ x29, sp adrp​ x19, a_m add​ x19, x19, :lo12:a_m​ // reference a_m from second.s ldr​ w0, [x19] bl​ myfunc​ // reference myfunc from second.s ldp​ x29, x30, [sp], 16 ret ​ Assembly code: second.s.global​a_m a_m:​.word​ 44​ // global constant.balign​4.global​myfunc​ // global function myfunc​: stp​ x29, x30, [sp, -16]! mov​ x29, x30 sub​ w0, w0, 1 ldp​ x29, x30, [sp], 16 ret ​ To assemble and link: as first.s -o first.o​ // invoke the assembler as second.s -o second.o​ gcc first.o second.o -o myexec ​ To view in gdb: (gdb) x/8i main (gdb) x/wd a_m​ // w: read word, d: show as decimal (gdb) x/5i myfunc ​ Source code is often divided into several modules ○​ i.e. separate.c or.s files ○​ Makes development of large projects easier ​ Modules can be compiled into relocatable object code ○​ Will be put into corresponding.o files ○​ Eg: gcc -c myfile1.c ​ // produces myfile1.o ​ Object code is linked together to create an executable ○​ Is done by ld (loader), usually invoked by gcc as the final step in the compilation process ○​ Eg: gcc myfile.o myfile2.o -o myexec ○​ ld resolves any references to external data or functions in other modules ○​ Code and data are relocated in memory as necessary to form contiguous text, data, and bss sections ​ … ​ To assemble and link: as first.s -o first.o as second.s -o second.o gcc first.o second.o -o myexec ​ This can be done more easily using a makefile ​ C code can call functions written in Assembly ○​ For optimization of sections of code ​ Eg: C code:​ mymain.c #include int sum(int, int); int main(){ int i = 5, j = 10, result; result = sum(i,j);​ // defined in sum.s printf(“result = %d\n”, result); return 0; } Asm code:​ sum.s.balign ​4.global ​sum sum:​ stp​ x29, x30, [sp, -16]! mov​ x29, sp add​ w0, w0, w1 ldp​ x29, x30, [sp], 16 ret To compile and link: gcc -c mymain.c​ // produces mymain.o in same directory as sum.s -o sum.o​ // can be before first gcc instr gcc mymain.o sum.o -o myprog​ // compiles final executable./myprog Input and Output ​ A computer may do input and output in several ways ○​ Interrupt-driven I/O ​ Will be covered in cpsc 359 ○​ Memory-mapped I/O ○​ Port I/O ○​ System I/O ​ Our ARM servers are running a Linux OS therefore only system I/O is available ○​ User-level programs communicate with external devices through OS ​ Prevents malicious or accidental interactions that may damage the device ​ eg: ​ mov​ x8, 57 svc​ 0 ​ Arguments to system calls are put in x0 - x5 ​ Any return value is in x0 ​ In Unix (Linux), all peripheral devices are represented as files ○​ present a uniform interface for I/O ​ Typical pattern: ○​ Open file ​ i.e. connect to device, get its file descriptor ○​ Read from or write to file ​ i.e. do device I/O, transferring bytes ○​ Close the file ​ File I/O involves interacting with secondary memory ​ Standard I/O involves mouse, keyboard, display devices ​ Opening a file: ○​ C code: int fd = openat(int dirfd, const char *pathname, int flags, mode_t mode); ​ dirft: directory file descriptor (used if path is relative) ​ Can be set to AT_FDCWD (value -100), to indicate that the pathname is relative to the program’s current working directory ​ pathname: relative or absolute pathname to a file ​ flags: combination of constants indicating what will be done to the file’s data ​ O_RDONLY​ 00​ Read-only access ​ O_WRONLY​ 01​ Write-only access ​ O_RDWR​ 02​ Read/write access ​ Optional: ​ O_CREAT​ 0100​ Create file if it does not exist ​ O_EXCL​ 0200​ Fail if file exists (with O_CREAT) ​ O_TRUNC​ 01000​ Truncate an existing file ​ O_APPEND​ 02000​ Append access ​ mode: optional argument that specifies UNIX file permissions ​ Required when creating a new file with specific permissions ​ Specified in octal ​ e.g. 0700 specifies read/write/exec permission for file owner, none for group or others ​ fd: the returned file descriptor ​ ls -l on error ​ e.g. Assembly code: pm:​.string​ “myfile.bin” … mov​ w0, -100​ // first arg: use relative to cwd adrp​ x1, pn​ // second arg: pathname add​ x1, x1, :lo12:pn mov​ w2, 0​ // third arg: readonly mov​ w3, 0​ // fourth arg: not used mov​ x8, 56​ // openat io request svc​ 0​ // call system function cmp ​ w0, 0​ // error check b.ge​ open_ok … … ​ Assembly code: buf_size = 32 alloc = -(16 + buf_size) & -16 dealloc = -alloc buf_s = 16 main:​ stp​ x29, x30, [sp, alloc]! mov​ x29, sp add​ x1, x29, buf_s​ // 2nd arg (ptr to buf) top:​ mov​ w0, 0​ // 1st arg (stdin)​ mov​ x2, buf_size​ // 3rd arg (BUFSIZE) mov​ x8, 63​ // read I/O request svc​ 0​ // call sys function cmp​ x0, 0​ // compare n_read and 0 b.le​ exit​ // exit loop if n_read is

Use Quizgecko on...
Browser
Browser