CPSC 355 Lecture Notes PDF
Document Details
Uploaded by Deleted User
Tags
Related
- Computer Organization and Design RISC-V Edition PDF
- Computer Organization and Architecture (11th Edition Global) PDF
- Computer Systems 245 Lecture Notes PDF
- IM Computer Systems Organization - PDF
- BCA - 2nd Year 3rd Semester Computer Architecture And Assembly Language PDF
- Session 2 Introduction to Computer Architecture PDF
Summary
These lecture notes cover computer architecture, including CPU architecture, the fetch-execute cycle, data representation, assembly language, C programming, and different CPU architectures (Accumulator, Load/Store, RISC, CISC).
Full Transcript
CPSC 355 Lecture Notes GOODBYE MANZARA ❣️ ❣️ WE WILL ALWAYS REMEMBER YOU Course objectives: Computer structure, i.e. architecture, specifically CPU architecture Learn how a computer operates...
CPSC 355 Lecture Notes GOODBYE MANZARA ❣️ ❣️ WE WILL ALWAYS REMEMBER YOU Course objectives: Computer structure, i.e. architecture, specifically CPU architecture Learn how a computer operates ○ fetch-execute cycle Learn how data and instructions are represented internally ○ Signed and unsigned ints ○ Strings, chars ○ Floats ○ Machine instructions Assembly ○ Used for: Embedded systems OS kernels Device drivers Code generator as part of compiler (rarely) Applications ○ Helps to understand: Computer architecture Operating system Writing efficient programs Connection between high level languages and machine operation C High-Level Architecture Computer system (most basic) consists of: CPU System clock Primary memory (RAM) Secondary memory (SSD, HDD, etc.) Peripheral input and output devices Bus Data transfer via bus diagram CPU(curved rect)--Clock (circle) ^v ← Bus → ^v ^v ^ v Prim. Sec. Input Output (sqr) (c. sqr)(c. rect)(c. rect) CPU Executes instructions Controls transfer of data across the bus Usually contained on a single microprocessor chip ○ Intel Core i5 ○ APM883208-X1 Applied Micro ○ Apple A7 Arm technology, branded with Apple name Three main parts ○ Control Unit (CU) Directs execution of instructions Loads an opcode from primary memory into the Instruction Register (IR) Decodes the opcode to identify the operation If necessary, transfers data between primary memory and registers If necessary, directs the ALU to operate on data in registers ○ Arithmetic Logic Unit (ALU) Performs arithmetic and logical operations on data in registers e.g. Add numbers from 2 source registers, then store the result in a third register. e.g. Bitwise logic, like AND between two source registers. ○ Registers Binary storage units in the CPU May contain: ○ Data ○ Addresses/pointers ○ Instructions ○ Status information General-purpose registers are used by a programmer to temporarily hold data and addresses. Lose information when power is turned off. The Program Counter (PC) contains the address in memory of the currently executing instruction. To progress to the next instruction, it is simply incremented. The Status Register (SR) contains information (flags, which are bits) about the result of a previous instruction e.g. overflow, carry System Clock Generates a clock signal to synchronise the CPU and other clocked devices ○ Is a square wave at a particular frequency shape is like a wave but with only 90 degree angles ○ Clock rate (speed) in GHz GHz = billion cycles per second Primary Memory Random access memory. A better term would be “direct access memory”, or “non-sequential access memory”. The word “random” is a historical term that does not really make sense. Any byte in memory can be accessed directly with its address volatile: data disappears when power is lost Stores: ○ Program instructions ○ Program data (variables) Consists of a sequence of addressable 1-byte memory locations. ○ There are architectures with more than 1 byte per location, but these are rare Architectures: ○ von Neumann architecture: RAM contains both data and programs (instructions). Will be using this on our ARM servers ○ Harvard architecture: uses separate memories for data and programs. Bus Copper wires on motherboard Set of parallel data/signal lines Transfers information between computer components Often subdivided into: ○ address bus CPU to RAM Specifies memory location in RAM (a positive integer) Or sometimes, a memory-mapped I/O device. Common sizes: 32 or 64 bits ○ data bus Communication between CPU and RAM Common sizes: 32 or 64 bits ○ control bus Communication between CPU and RAM Used to control or monitor devices connected to the bus E.g. read/write signal for RAM Handled at the level of hardware - we do not control this, and will not spend much time on it ○ An expansion bus might be connected to the computer’s local bus For attaching input and output devices Bus standards: USB SCSI (“scuzzy”, older standard) PCIe CPU -address-> Primary Memory Secondary Memory Holds a computer’s file system Non-volatile memory ○ Persists throughout power cycle Usually embodied on a HDD or SSD Peripheral I/O Devices Allow communication between computer and external environment Input examples: ○ Microphone (transducer, uses ADD (analog to digital converter) to convert sound waves into computer readable information) ○ Mouse, keyboard, trackball, joystick, etc. Output: ○ Monitor, printer, speakers, etc. I/O devices: ○ HDD Need to learn system io ○ Modem ○ Connections to networks Basic CPU Architectures Accumulator Machines CPU > ALU (in: ACC, data bus) (out: ACC) > ACC (in: ALU) (out: ALU) out: primary memory, via address bus i/o: primary memory, via data bus Operands for an instruction come from the accumulator register (ACC) and from a single location in RAM ALU results are stored in the ACC, overwriting the old results The ACC can be loaded from or stored to the RAM Load/Store Machines CPU > ALU in: Register (2x) out: Register > Register in: ALU Note: The “register file” is not a file, but a set of registers. Could be thousands. Register file can be saved to or loaded from RAM. Only load/store operations can access RAM Other instructions operate on specified registers in the register file, not on RAM ○ Physical distance to registers is much smaller than to RAM Typical program sequence: ○ Load registers from memory (Load instructions) ○ Execute an instruction using two source registers, putting the result into a destination register. ○ Store the result into memory (Store instruction) RISC and CISC Architectures RISC: Reduced Instruction Set Compiler Uses only simple instructions that can be executed in one machine cycle. Advantage: Enables faster clock rates, thus faster overall execution Disadvantage: Programs require more instructions and become larger without access to complex instructions Ex. Original SPARC architecture had no multiply instruction. ○ Done using repetitive add-shift operations. Machine instructions are always the same size ○ Makes decoding simpler and faster ○ ARMv8 instructions are always 32 bits wide CISC: Complex Instruction Set Compiler May have instructions that take many cycles to execute ○ Slows down clock rate but makes programming easier e.g. Intel Core 2 ○ add: 1 cycle ○ subtract: 1 cycle ○ multiply: 5 cycles ○ division: 40 cycles Machine instructions can vary in length, and may be followed by “immediate” data. ○ Makes decoding difficult and slow ○ ex: Intel x86 Can be 1-15 bytes Instruction Cycle Fetch-execute or fetch-decode-execute cycle CPU executes each instruction in a series of small steps: ○ Fetch the next instruction from memory into the instruction register (IR) The program counter register (PC) stores the address ○ Increment the contents of the PC to point to the next instruction ○ Decode the instruction ○ If the instruction uses an operand in RAM, calculate the address ○ Fetch the operand ○ Repeat previous two steps if necessary ○ Execute the instruction ○ Calculate address of RAM to store result in, if necessary. ○ Store result Assembly Language Programs Consists of a series of statements, each corresponds to machine instructions. ARMv8 example: add x20, x20, x21 corresponds to: ○ 1000 1011 0001 0101 0000 0010 1001 0100 ○ 0x8b150294 Each statement consists of an opcode, and a variable number of operands ○ “add x20, x20, x21” → “opcode operand,operand,operand” Instructions are stored sequentially in RAM, therefore each instruction has a unique address. Optionally, a label can prefix any statement. ○ Form: “label: statement” ○ Is a symbol whose value is the address of the machine instruction.r ○ May be used as a target for a branch instruction Pseudo-ops (assembler directives) do not generate machine instructions, but give the assembler extra information. ○ Form: “.” ○ exe. “.global start” Comments may follow a statement. For example, after a // delimiter. Use indents to format label, opcode, operands, comments Use an editor like emacs to automate this Assemblers Translates assembly source code into machine readable language In this course, we are using the “GNU as assembler”. ○ Part of the GNU gcc compiler suite To assemble source code: ○ gcc myprod.s -o myprod Assume that any file that ends in “.s” contains assembly language source code. Macro Preprocessors Many assemblers support macros ○ Allows you to define a piece of text with a macro name, ○ Parameters can optionally be specified ○ Macros: little bits of text that expand into larger commands This text will be substituted inline wherever invoked, called macro-expansion Improves readability Unfortunately gcc has limited support for macros ○ Therefore, we will use another macro preprocessor called m4 before invoking gcc m4 is built into unix and is a standard unix command m4 e.g. define(coef, x23) // defines 2 macros. Constants named coef and z_r define(z_r, x18) // Named registers coef and z_r. //“z_r” means store z in a register instead of memory (conventional name).. add x19, z_r, coef // store the sum of.. // expands to:... add x19, x18, 23 // x19 is the name of a register, not touched) macro preprocessors (cont.) General procedure: ○ Put your src code containing macros into a file ending in.asm ○ Invoke m4, redirecting output to a file ending in.s ○ e.g. m4 pyprog.asm > myprog.s ○ run gcc on the output file Readings and Exercises: P & H: Chapter 1 Most of the text we’re dealing with is 2-3, especially 2.7. ARMv8 Architecture Introduction This course uses the Gigabyte R152-P33 servers (two of them) CPU is the Ampere Altra Q64-22 64-bit processor ○ It is an implementation of the ARMv8-A specification. Licensed from ARM Holdings, PLC. ○ In the name, -A: application. -M: microprocessor. ○ ARM: Advanced RISC Machine (originally Acorn RISC Machine) Installed OS is Linux Fedora 40. ○ Includes up-to-date versions of gcc, as, gdh, and m4 Access ssh with addresses: ○ csa1.uc.ucalgary.ca (computer science arm 1) ○ csa2.uc.ucalgary.ca ○ Load balancer; csarm.ucalgary.ca Puts you on whichever is less active Architecture The ARMv8-A architecture: ○ is a RISC ○ is a Load/Store machine register file contains 31 64-bit-wide registers Most instructions manipulate 64-bit data stored in these registers. When we can get away with just 32-bit data, we only use the right half of the registers. ○ uses von-Neumann architecture for RAM ○ has two execution states: AArch64: Uses the instruction set called A64 with 64 bit registers. Used exclusively in this course AArch32: Runs code developed on older versions of the ARM architecture. Uses A32 or T32 instruction sets. ○ Backwards compatibility with older ARM and THUMB instruction sets that use 32-bit instruction sets ○ has four exception levels: EL0: Application, Application, Application, Application Normal user applications with limited privileges Restricted access to a limited set of instructions and registers, as well as certain parts of memory ○ Prevent interfering with OS Most common EL1: OS Kernel, OS Kernel For the OS kernel ○ Privileged access to instructions, registers, and memory. ○ Accessed indirectly by user programs using system calls EL2: Hypervisor For a Hypervisor: ○ above “supervisor” el1 ○ Supports virtualization, allowing computer to host multiple guest operating systems on their own virtual machines. EL3: Secure Monitor Low level firmware ○ Includes the Secure Monitor Registers AArch64 has 31 64-bit-wide general purpose registers, ○ Numbered 0-30 ○ When using all 64 bits, use x or “X” before the number. Stands for “extended” e.g. x0, x1, …, x30 ○ When using the low-order 32 bits of the register, use “w” or “W” stands for “word” e.g. w2, w29 Many registers have special uses: ○ x0 - x7: Used to pass arguments into a procedure, then return results. ○ x8: Indirect result location register ○ x9-x15: Temporary registers ○ x16-x17: Intra-procedure-call temporary registers ○ In documentation, often called IP0 and IP1. ○ x18: Platform register (for some operating systems) ○ x19-x28: Free registers (use for work we are doing) ○ x29: Frame pointer (FP) register ○ x30: Procedure link register (LR) x19-x28 are callee-saved registers. Means they are safe to use. ○ Value is preserved by any function you call. Special purpose registers: ○ Stack Pointer: Letters SP in code. 64 bit wide register used in A64 code. WSP: 32 bit for A32 code, ignore for now Points at the top of the run-time stack ○ Zero Pointer: XZR: 64 bits wide WZR: 32 bits wide Gives 0 value when read from Discards value when written to ○ Program Counter PC: 64 bits wide Holds the address of the currently executing instruction Can’t access directly as a named register Is changed indirectly by branch instructions Is used implicitly by pc-relative load/store instructions Can be accessed in gdb with $pc ○ Floating Point Registers There are 32 128-bit-wide floating-point registers. Discussed at the end of the course ○ System registers (about 600) Only accessible at EL1 (OS kernel) Used in OS kernel code A64 Assembly Language Consists of statements with one opcode and 0 to 4 operands. ○ The allowed operands depend on the particular instruction ○ In general: First operand: destination register The other three: source registers E.g. add x19, x20, x21 x19 is the destination, the others are source 1 and 2 An immediate value (a constant) may be used as the final source operand for some instructions. ○ E.g. add x19, x20, 42 ○ A # symbol can prefix the immediate value, but is optional when using gcc. ○ Constant range depends on the instruction, as the machine instruction determines the number of available bits. ○ Immediates are assumed to be decimal numbers unless: Hexadecimal: 0x Octal numbers: 0 e.g. 0777 Binary numbers: 0b e.g. 0b101 Some instructions are aliases of other instructions ○ e.g. mov x29, sp copy sp into x29 is an alias for add x29, sp, 0 ○ sp = x29 + 0 Commonly used instructions ○ Note: Move instructions are more like copy instructions, because they do not affect the source register. ○ Note: W is half of the 64 bit memory, X is all 64 bits ○ Move immediate (32-bit) Form: mov Wd, #imm32 Wd: general purpose register #imm32: immediate 32-bit value ○ -2^31 to 2^32-1 e.g. mov w20, -237 ○ Move immediate (64-bit) Form: mov Xd, #imm64 Xd: destination register #imm64: 64 bit immediate value, -2^63 to 2^64-1 e.g. mov x21, 0xFFFE ○ Move register (32-bit) Form: mov Wd, Wm destination, source registers Alias for orr Wd, wzr, Wm ○ Move register (64-bit) e.g. mov x22, x20 Note that you cannot use both 32 and 64 bit values. ○ Branch and Link (bl) function Can be a library or your own function e.g. printf e.g. bl label e.g. bl printf Arguments are put into x0-x7 before the call Return value is in x0 Basic Program Structure The main routine:.global main main: stp x29, x30, [sp. -16]! mov x29, sp // these three lines save the state of the calling code... ldp x29, x30, [sp], 16 ret // previous one: restores state Memorize this!.global main makes the label “main” visible to the linker main() routine is where execution always starts …[sp, -16]! Allocates 16 bytes in stack memory (in RAM) Does so by pre-incrementing the SP register by -16 stp x29, x30, … Stores the contents of the pair of registers to the stack ○ x29: frame pointer (FP) ○ x30: link register (LR) ○ SP points to the location in RAM where we write to Saves the state of the registers by calling code x29 and x30 are both 8 bytes long hence allocating 16 bytes of memory mov x29, sp Updates FP to the current SP FP may be used as a base address in the routine ldp x29, x30, … Loads the pair of registers from RAM ○ SP points to the location in RAM where we read from Restores the state of the FP and LR registers …[sp], 16 Deallocates 16 bytes of memory Does so by post-incrementing SP by +16 ret Returns control to calling code (in OS) Uses address in LR Basic Arithmetic Instructions Addition: Uses 1 destination, 2 source Register (64 or 32) ○ e.g. add x19, x20, x21 // x19=x20+x21 Immediate (64 or 32) ○ e.g. add w20, w20, 2 // w20=w20+2 Subtraction: Uses 1 destination 2 source operands Register (32 or 64 bit): ○ sub x0, x1, x2 //x0 = x1 -x2 Immediate (...) Multiplication: 1 destination and 2 or 3 source registers No immediates allowed Form (32 bit): mul Wd, Wn, Wm // Wd = Wn * Wm Alias for: madd Wd, Wn, Wm, wzr ○ wzr → zero register Form (64-bit): ○ e.g. mul x19, x20, x20 //square number Multiply-Add Form (32-bit): ○ madd Wd, Wn, Wm, Wa Wd = Wa + (Wn * Wm) ○ madd w20, w21, w22, w23 w20 = w23 + (w21*w22) Form (64 bit): ○ madd x20, x0, x1, x20 x20 = x20 + (x0 * x1) Multiply-Subtract Form (32-bit): ○ msub Wd, Wn, Wm, Wa Wd = Wa - (Wn * Wm) Multiply-Negate Form (32-bit): ○ mneg Wd, Wn, Wm Wd = - (Wn * Wm) Division No immediates allowed Signed form (32 or 64 bit): ○ sdiv Wd, Wn, Wm signed divide operands are signed integers Wd = Wn / Wm Unsigned form (32 or 64 bit) ○ udiv These instructions do integer division; remainder is discarded ○ The remainder (modulus) can be calculated using numerator - (quotient * denominator) msub destination, quotient, denominator, numerator Dividing by 0 does not generate an exception ○ Instead, writes 0 to the destination register Other variants exist and can be found in ARM documentation on D2L Printing to Standard Output Call printf ○ Standard function in the C library ○ Invoked with ≥ 1 arguments First is the format string (usually a literal “”) The rest corresponds to the number of placeholders in the string ○ Example C code: … int x = 42; printf(“Meaning of life = %d\n”, x); Equivalent assembly code: ○ fmt:.string “Meaning of life = %d\n” // creates format string.balign 4 // ensures instructions are aligned in memory.global main main: … adrp x0, fmt add x0, x0, :lo12:fmt // Arg 1: address of the string mov w1, 42 bl printf // function call … Branch Instructions and Condition Codes a branch instruction transfers control to another part of the code ○ Like a goto in the C language PC register is not incremented as usual, but set to the computed address of an instruction ○ Corresponds to the value of its label An unconditional branch is always taken ○ Form: b label e.g. b top condition flags store information about the result of an instruction ○ Single bit units in the CPU (arm has no status register?) Record process state (PSTATE) information ○ Four flags: Z: true if result is 0 N: true if result is negative V: true if result overflows C: true if the result generates a carry out ○ Condition flags are set by instructions that end in “s” (short for set flags) ○ E.g. subs, adds ○ subs may be used to compare two registers ○ Eg: subs x0, x1, x2 Also sets flags ○ cmp is more intuitive 64 bit: cmp Xn, Xm alias for: subs xzr, Xn, Xm e.g.: cmp x1, x2 ○ Conditional branch instructions use those flags to make a decision If a particular flag is true, we take the branch i.e. “jump” to the instruction at the specified label otherwise, control goes to the next instruction in the program e.g.: b.eq top // jump to a line labeled top branches if Z is true Form: b., where cc is the condition code For signed integers: Name Meaning C equivalent Flags Complement eq equal == Z == 1 ne ne not equal != Z == 0 eq gt greater than > Z == 0 && N == V le ge greater than or equal >= N == V lt lt less than < N != V ge le less than or equal to B){ // do stuff } → define(x19, a_r) define(x20, b_r) define(x21, c_r) define(x22, d_r) cmp a_r, b_r b.le skip if: // Do stuff // Do stuff skip // Do other stuff The if-else Construct Is formed by branching to the else part if the condition is not true ○ Use the logical complement ○ If true, the code falls through to the if part E.g.: C: if (a > b) { c = a+b; d = c+5; } else { c = a-b; d = c-5; } Assembly: define(a_r, x19) define(b_r, x20) define(c_r, x21) define(d_r, x22) cmp a_r, b_r b.le else add c_r, a_r, b_r add d_r, c_r, 5 b next else: sub c_r, a_r, b_r sub d_r, c_r, 5 next: … Introduction to the GDB Debugger to start a program under debugger control: ○ gdb to set a breakpoint: ○ b to run program: ○ r ○ stops at the first breakpoint use c to continue to the next breakpoint use si to single step through program use ni to proceed through the entire next instruction ○ for example, printf is a function call, so rather than stepping into it, proceed through all of the instructions that are part of that function display/i $pc to automatically show the current instruction ○ display instruction at the program register p $ to print a register ○ can append a format character: signed decimal: p/d hexadecimal: p/x binary: p/t (letter b is used somewhere else) use q to quit gdb Binary Numbers and Integer Representations Binary numbers Base 2 numbers Use binary digits 0 and 1 Easy to encode on a computer, since only two states need to be distinguished ○ Using voltages: 0 = 0 V 1 = 3.3 V on modern machines, used to be 5 V on older machines Using physical media, typically 0 represents default/unaltered/off state, whereas 1 represents acted upon/altered/on state An n-bit size register can hold 2^n bit patterns ○ 4 bit register can hold 16 distinct bit patterns Unsigned Integers Encoded using simple binary numbers Range: 0 to 2^n - 1, where n is the number of bits Signed Integers Most commonly encoded using the two’s complement representation Range: -2n-1 to 2n-1 - 1 ○ 4 bits : -8 to 7 ○ 8 bits: -128 to 127 ○ 32 bits: -2,147,483,648 to 2,147,483,647 ○ 64 bits: -263 to 263 - 1 Negating a number is done by: ○ taking the one’s complement flip 0s to 1s ○ add 1 to the result Find the bit pattern for -5 in a 4-bit register ○ 5 = 0101 ○ → 1010 → 1011 = -5 ○ In the other direction: ○ 1011 → 1010 → 0101 = 5 All positive numbers will have a 0 in the leftmost bit, all negatives will have 1 in the leftmost bit ○ Sign bit Sign-magnitude and one’s complement representations are also possible ○ However, they are awkward to handle in hardware ○ Have two zeros (+0, -0) Hexadecimal numbers base 16 Use 0, 1, 2, …, 9, A, B, C, D, E, F Shorthand for denoting bit patterns ○ 0xF5A = 1111 0101 1010 Octal numbers base 8 0-7 3 bits Each digit corresponds to a 3-bit pattern e.g. 0756 → 111 101 110 Integer classes and subtypes Linux on ARMv8 in AArch65 uses the LP64 data model ○ Long integers and Pointers are always 64 bits long ○ A64 Keyword Size in bits C keyword byte 8 char (misleading -- you can also store ints and units in here) halfword 16 short int word 32 int doubleword 64 long int, void * (any kind of pointer) quadword 128 n/a In C, use the keyword unsigned to denote unsigned integers ○ e.g. unsigned int x; // 32 bits ○ e.g. unsigned char y // 8 bits Bitwise Operations Bitwise Logical Instructions Manipulate bits in a register a b a AND b 0 0 0 0 1 0 1 0 0 1 1 1 Form (64-bit): ○ and Xd, Xn, Xm e.g. mov x19, 0xAA // 1010 1010 . mov x20, 0xF0 // 1111 0000 . and x21, x19, x20 // 1010 0000 ○ Note that 0xF0 forms a bitmask: lets certain things through ○ 32 bit and immediate versions also exist with w e.g. mov w19, 0x55 // 0101 0101 . and w19, w19, 0xF // 0000 1111 → 0000 0101 ands sets or clears N and Z flags according to the result. V and C are always clear. ○ e.g. test if bit 3 is set in x20 fourth bit from the right Is there a 0 or 1 in there? bitmask: 1000 = 0x8 … . ands x19, x20, 0x8 . b.eq bitclear // if it equals 0, jump to bitclear bitset: … bitclear:... Missing notes!!! I was away for a couple of weeks so the notes are incomplete here. Make sure to read the binary logic and memory and stack slides. Basic Load and Store Instructions, cont. Store Byte ○ Form (32-bit only): strb Wt, addr ○ stores low-order byte in Wt to RAM Store Halfword ○ Form (32-bit only): strh Wt, addr ○ Stores low-order halfword (2 bytes) in Wt into RAM Load/Store Address Modes Pre-indexed by immediate offset ○ Form: [base, #imm]! base: X register (typically x29, the frame pointer) or SP #imm: A 9-bit signed constant (range: -256 to 255) ○ base is first updated by adding immediate to it This address is then used for load/store ○ Example: e.g. … add x28, x29, x16 // x28 = frame pointer + 16 str w20, [x28, 8]! … Post-indexed by immediate offset ○ Form: [base], #imm base: an X register, typically x29, or SP #imm: 9-bit signed constant (range -256 to 255) ○ The address in base is used for load/store Base is updated after ○ E.g. … add x28, x29, 16 str w20, [x28], 8 Stack Variable Offset Macros m4 macros can be used for offsets to improve readability E.g. ○define (a_s, 16) // When using stack memory, we append _s to the name ○define (b_s, 20) ○define (c_s, 24) ○define (d_s, 28) ○… str w20, [x29, a_s] ldr w21, [x29, b_s] Can also be done with assembler equates (not related to m4) e.g.: ○ a_s = 16 ○ b_s = 20 ○ … ○ str w20, [x29, a_s] ○ ldr w21, [x29, b_s] Register equates are useful for renaming x29 (FP) and x30 (LR, which we haven’t used much). ○ e.g. fp.req x29 // register equate fp to x29 lr.req x30 // register equate lr to x30 … str w20, [fp, a_s] Local Variables In C, can be declared in a block of code. i.e. in any construct delimited by {...} e.g. int main() { int a = 5, b = 7; // local to main … if (a < b) { int c; // local to if-construct } } Implemented in assembly as stack variables ○ Those local to the function are allocated upon entering the function, as shown previously? ○ Those local to other blocks of code are: Allocated when entering the block Done by decrementing the SP to allocate memory on the top of the stack Are read or written using FP plus a negative offset Deallocated when leaving the block Done by incrementing SP ○ E.g. Equivalent assembly code a_s = 16 b_s = 20 c_s = -4 … main: stp x29, x30, [sp, -(16 + 8) & -16]! // 16: frame record // 8: two variables, a and b, which are regular 4-byte integers // Add those numbers and negate as usual // & with -16 allocates the appropriate number of bytes mov x29, sp mov w19, 5 // init a to 5 str w19, [x29, a_s] mov w20, 7 // init b to 7 str w20, [x29, b_s] cmp w19, w20 b.ge next // start of block of code for if-else add sp, sp, -4 & -16 // alloc RAM for c // bitwise & with -16 to align properly mov w21, 10 // init c to 10 str w21, [x29, c_s] … // end of block of code add sp, sp, 16 next: … After allocating RAM for C, the stack appears as: Free memory (many bytes) SP → Pad bytes (12B) | c (4B) | FP → | Frame record |→ Stack frame for main (when we are in the a (4B) | middle of the if construct after allocating b (4B) | memory for c) Pad bytes (8B) | Stack frame Note that, as shown in this visual, we always allocate in groups of 16B Readings: P & H: Section 2.3, p. 166 - 168 ARMv8 Instruction Set Overview: ○ Section 4.5 ○ Section 5.3.1 Data Structures One-Dimensional Arrays Must be done manually Store consecutive array elements of the same type in a contiguous block of memory ○ Block size = * ○ Address of element i is: + (i * ) E.g.: int ia; // 1D int array with size 5 // Each int is 4 B long, so we need 20 B Like other local variables, an array is allocated in the function’s stack frame e.g. C code: int main() { int a, b, ia; … } Assembly code: a_size = 4 // a is 4 bytes b_size = 4 ia_size = 5 * 4// 5 elements in the array, each is 4 bytes long alloc = -(16 + a_size + b_size + ia_size) & -16 // We always need 16 bytes for the frame renderer. Add the size of each // element to that. Negate the result of the additions. Use & -16 to align to 16 // bytes. dealloc = -alloc a_s = 16 // offset is 16 bytes for a because of 16 bytes for frame r. b_s = 20 // offset is 20 bytes ia_s = 24 // main: stp x29, x30, [sp, alloc]! mov x29, sp … lp x29, x30, [sp], dealloc ret Memory is used as follows: Address Item Free memory SP → FP → Frame record (16B) | fp+b_s a (4B) | fp+b_s b (4B) | fp+ia_s+0 ia (4B) | -- Stack frame for main fp+ia_s+4 ia (4B) | … | fp+ia_s+16 ia (4B) | pad bytes | // because of “& -16” Array elements are accessed using load and store instructions E.g.: ia = 13; define(ia_base_r, x19) // where the array starts in memory define(index_r, x20) define(offset_r, x21) … add ia_base_r, x29, ia_s // calc array base address mov index_r, 2 // set index to 2 lsl offset_r, index_r, 2 // offset = index * 4 mov w22, 13 str w22, [ia_base_r, offset_r] // ia = 13 Can be optimised by using the LSL in the STR instruction: mov index_r, 2 mov w22, 13 str w22, [ia_base_r, index_r, LSL 2] // ia = 13 If index is a w register, one must sign extend first: define(index_r, w20) … str w22, [ia_base_r, index_r, SXTW2] // ia = 13 // sign extend w Calling random bl rand ○ Result appears in w0 ○ Pseudo-random number generator, sequence is always the same when running the program Multidimensional Arrays Use 2 or more indices to access individual array elements. Most languages use row major order when storing arrays in RAM ○ i.e. first all elements of row 0, then 1, etc. e.g. int ia 24 byte block: Address Memory ia+0 ia ia+4 ia ia+8 ia ia+12 ia ia+16 ia ia+20 ia Block size for an n-dimensional array: 𝑛 ○ ∏ 𝑑𝑖𝑚𝑟 × 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑠𝑖𝑧𝑒 𝑟=1 ○ where dimr is the number of elements in dimension r 𝑏 𝑏 ○ the ∏ is just ∑ for multiplication 𝑎 𝑎 ○ e.g. int ia block size: 2 * 3 * 3 * 4 bytes = 72 bytes The offset for a given element is: 𝑛−1 𝑛 ⎡ ⎡ ⎤ ⎤ ○ ⎢ ∑ ⎢𝑖𝑛𝑑𝑒𝑥𝑟 × ∏ 𝑑𝑖𝑚𝑟⎥ + 𝑖𝑛𝑑𝑒𝑥𝑛⎥ × 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑠𝑖𝑧𝑒 ⎢ 𝑟=1 ⎢ ⎥ ⎥ ⎣ ⎣ 𝑟+1 ⎦ ⎦ where index is the index for the element in dimension r ○ e.g. a 3-dimensional array would be declared with: type array[dim1][dim2][dim3] Its offset for element array[index1][index2][index3] is: ((𝑖𝑛𝑑𝑒𝑥1 * 𝑑𝑖𝑚2 * 𝑑𝑖𝑚3) + (𝑖𝑛𝑑𝑒𝑥2 * 𝑑𝑖𝑚3) + 𝑖𝑛𝑑𝑒𝑥3) * 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑠𝑖𝑧𝑒 int ia: offset for ia[i][j][k] would be: ((i*3*4)+(j*4)+k)*4 Six arithmetic operations ○ Makes high dimensional arrays slow Eg: C code for 2D array int main() { int ia; register int i, j;// keyword in C; hint, if possible, instead of allocating in // the stack, just use a register for each. … ia[i][j] = 13; … } Assembly: define(ia_base_r, x19) define(offset_r, w20) define(i_r, w21) define(j_r, w22) dim1 = 2 dim2 = 3 ia_size = dim1 * dim2 * 4 alloc = -(16 + ia_size) & -16 // 16 for frame record, -16 to guarantee align. dealloc = -alloc ia_s = 16 // ia comes immediately after frame record … main: stp x29, x30, [sp, alloc]! // frame pointer, link register, stack pointer mov x29, sp … add ia_base_r, x29, ia_s // array base address = fp + offset mov w23, dim2 mul offset_r, i_r, w23 // offset = (i * dim2) add offset_r, offset_r, j_r // offset = (i * dim2) + j lsl offset_r, offset_r, 2 // offset = ((i * dim2) + j) * 4 mov w24, 13 str w24, [ia_base_r, offset_r, sxtw] // ia[i][j] = 13 // optimise using lsl in str Structures Contain fields of different types Allocated in a single block of memory on the stack ○ Fields are accessed using offsets Eg: struct rec { // structure tag (rec short for record, but it’s unimportant) int a; char b; short c; }; Memory Offset a: 4 bytes rec_a = 0 b // rec comes from the structure tag b: 1 byte rec_b = 4 bytes c: 2 bytes rec_c = 5 bytes If the base address is in x19, the field are accessed with: ○ ldr w20, [x19, rec_a] ○ ldrsb w21, [x19, rec_b] // ldr signed byte ○ ldrsh w22, [x19, rec_c] // ldr signed halfword Eg: struct rec { int a; char b; ← Defines a custom data type (struct) short c; Does not allocate memory, do that in main }; int main() { struct rec s1; ← allocates memory for local variable s1 … s1.a = 42; s1.b = 23; s1.c = 13; … } Assembly equivalent: define(s1_base_r, x19) rec_a = 0 rec_b = 4 rec_c = 5 s1_size = 7 alloc = -(16 + s1_size) & -16 dealloc = -alloc s1_s = 16 // right after frame record … main: stp x29, x30, [sp, alloc]! mov x29, sp … add s1_base_r, x29, s1_s // calculate struct base address mov w20, 42 str w20, [s1_base_r, rec_a] // s1.a = 42 mov w20, 23 strb w20, [s1_base_r, rec_b] // s1.b = 23 mov w20, 13 strh w20, [s1_base_r, rec_c] // s1.c = 13 Nested Structures A structure may contain a field whose type is another structure ○ The field’s subfields are accessed using a second set of offsets. E.g. C code (Custom struct to record date) struct date { unsigned char day, month unsigned short year }; struct employee { int id; struct date start; // nested struct }; int main(){ struct employee joe; joe.id = 4001; joe.start.day = 1; joe.start.month = 6; joe.start.year = 1999; } Offset Nested offset Memory Variable 0 - 4001 (4B) joe.id 4 0 1 (1B) joe.start.day - 1 6 (1B) joe.start.month - 2 1999 (2B) joe.start.year Assembly code: define(joe_base_r, x19) date_day = 0 date_month = 1 date_year = 2 employee_id = 0 employee_start = 4 joe_size = 8 alloc = -(16 + joe_size) & -16 dealloc = -alloc joe_s = 16 // why? main: stp x29, x30, [sp, alloc]! mov x29, sp … add joe_base_r, x29, joe_s // id mov w20, 4001 str w20, [joe_base_r, employee_id] // joe.id = 4001 // date mov w20, 1 strb w20, [joe_base_r, employee_start + date_day] // joe.start.day = 1 // month mov w20, 6 strb w20, [joe_base_r, employee_start + date_month] // joe.start.month = 6 // year mov w20, 1999 strh w20, [joe_base_r, employee_start, date_year] // joe.start.year = 1999 // Note: The assembler evaluates the expression employee_start + date_year, producing a constant for runtime. Subroutines Basically functions Subroutines allow you to repeat a calculation using varying argument values Two types: ○ Open (inline) // Insert the code inline wherever the subroutine is invoked Usually a macro preprocessor Arguments are passed in/out using registers Efficient, since overhead of branching and returning is avoided Suitable only for fairly short subroutines because code can grow large ○ Closed Machine code for the routine appears only in RAM More compact machine code When invoked, control jumps to the first instruction of the routine i.e. PC loaded with the address of the first instruction When finished, returns control to the next instruction in the calling code i.e. The PC is loaded with the return address. Arguments are placed in registers or on the stack. Slower than inline routines Subroutines should not change the state of the machine for the calling code. ○ When invoked, a subroutine should save any registers it uses on the stack. ○ When returning, it should restore those values Arguments to subroutines are considered local variables ○ Subroutine may change their values Open (Inline) Subroutines Usually implemented using macros E.g. cube function define(comment) comment(cube(1 = input register, 2 = output register)) define(cube, `mul $2, $1, $1 mul $2, $1, $2`) // Must use the ` quotation mark (top left of keyboard).global main main: stp x29, x30, [sp, -16]! … mov x19, 8 cube(x19, x20) // x19 = input, x20 = output … Macro expands this to: … mov x19, 8 mul x20, x19, x19 mul x20, x20, x19 General form: label: stp x29, x30, [sp, alloc]! mov x29, sp … ldp x29, x30, [sp], dealloc ret label: names the subroutine alloc: number of bytes (negated) to allocate for the subroutine stack frame ○ SP must be quad-word aligned ○ Minimum of 16 bytes For frame record in stack frame Subroutine Linkage May be invoked using the branch and link instruction: bl ○ form: bl subroutine_label ○ Stores the return address in the link register, x30 Return address is PC+4, the next instruction after the bl instruction Use ret instruction to return from a subroutine to the calling code E.g. C code: int main { … func1(); … } void func1(){ … func2(); … } void func2(){ … } Assembly code: main: stp x29, x30, [sp, -16]! mov x29, sp … bl func1 … ldp x29, x30, [sp], 16 ret func1: stp x29, x30, [sp, -16]! mov x29, sp … bl func2 … ldp x29, x30, [sp], 16 ret func2: stp x29, x30, [sp, -16]! mov x29, sp … ldp x29, x30, [sp], 16 ret 16 bytes are for the frame record, which includes the previous frame pointer and link register. The stp instructions create a frame record in each function’s stack frame. ○ store LR (x30) in case it is changed by a bl in the body of the function ○ restored by the load pair instruction The FP and stored FP values in the frame records form a linked list ○ The stack while func2 is executing looks like: Free memory Saving and Restoring Registers A called function must save/restore the state of the calling code ○ If it uses any registers in x19 - x28, it must save their values to the stack at the beginning of the function called “callee-saved registers” ○ Function must restore the data in these registers before returning E.g. x19_size = 8 alloc = -(16 + x19_size) & -16 dealloc = -alloc x19_save = 16 // offset func2: stp x29, x30, [sp, alloc]! mov x29, sp str x19, [x29, x19_save] … mov x19, 13 … ldr x19, [x29, x19_save] ldp x29, x30, [sp], dealloc ret Note that the callee can also use register x9 - x15 ○ By convention, these registers are not saved/restored by the called function ○ Only safe to use in calling code between function calls ○ The calling code can save these registers to the stack, if it is necessary to preserve their values over a function call “caller-saved registers” Arguments to Subroutines 8 or fewer arguments can be passed into a function using the registers x0 - x7 ○ ints, short ints, and chars use w0 - w7 ○ long ints use x0 - x7 E.g. C code void sum(int a, int b) { register int i; i=a+b … } int main(){ sum(3,4); … } Assembly code define (i_r, w9) sum: stp x29, x30, [sp, -16]! mov x29, sp add i_r, w0, w1 ← w0 is first arg, w1 is second arg ldp x29, x30, [sp], 16 ret multiplication 5*3 M=5 T=0 N=3 iteration M T N 0 0101 0000 0011 1 0101 0101 0011 0101 0010 1001 2 0101 0111 1001 0101 0011 1100 3 0101 0011 1100 0101 0001 1110 4 0101 0001 1110 0101 0000 1111 last step 00001111 = 15 = 5*3 👍 M = 2, T = 0, N = -5 iteration M T N 0 0010 0000 1011 1 0010 0010 1011 0010 0001 0101 2 0010 0011 0101 0010 0001 1010 3 0010 0001 1010 0010 0000 1101 4 0010 0010 1101 0010 0001 0110 Since original multiplier is negative, subtract multiplicand from product i.e. T = T - M 0001 - 0010 = 0001 + 1110 = 1111 result: 11110110 = -00001010 = -10 main: stp x29, x30, [sp, -16]! mov x29, sp mov w0, 3 // set up first arg mov w1, 4 // set up second arg bl sum … ldp x29, x30, [sp], 16 ret // Subroutine regards x0-x7 as local sum: Pointer Arguments In calling code, the address of a variable is passed to the subroutine ○ Implies that the variable must be in RAM, not a register The called subroutine dereferences the address, to manipulate the variable being pointed to ○ Usually with a ldr or str instruction E.g. C code int main(){ int a = 5, b = 7; swap(&a, &b) // “address of” operator … } void swap(int *x, int *y){ // x and y are pointers to some integer register int temp; temp = *x; // de-referencing *x = *y; *y = temp; } Assembly code: a_size = 4 b_size = 4 alloc = -(16 + a_size + b_size) & -16 dealloc = -alloc a_s = 16 b_s = 20 define(temp_r, w9) … main: stp x29, x30, [sp, alloc]! mov x29, sp mov w19, 5 // init a = 5 str w19, [x29, a_s] mov w20, 7 // init b = 7 str w20, [x29, b_s] add x0, x29, a_s // set up first arg (address a) add x1, x29, b_s // set up second arg (address b) bl swap … swap: stp x29, x30, [sp, -16]! mov x29, sp ldr temp_r, [x0] // temp = *x ldr w10, [x1] // w10 = *y str w10, [x0] // *x = w10 str temp_r, [x1] // *y = temp ldp x29, x30, [sp], 16 ret Returning Integers A function returns: ○ long ints in x0 ○ ints, short ints, chars in w0 E.g. Cube function in C int cube(int x); // function prototype int main(){ register int result; result = cube(3); } int cube(int x){ return x * x * x; } Assembly code: define(result_r, w19) main: stp x29, x30, [sp, -16]! mov x29, sp mov w0, 3 // first arg bl cube mov result_r, w0 // put returned result in w0 ldp x29, x30, [sp], 16 ret cube: stp x29, x30, [sp, -16]! mov x29, sp mul w9, w0, w0 // w9 = x*x mul w0, w9, w0 // w0 = w9 * w0 = x*x*x ldp x29, x30, [sp], 16 ret Returning Structures In C, a function may return a struct by value E.g.: struct mystruct{ long int i; long int j; }; struct mystruct init(){ // function returns a struct called “mystruct” return lvar struct mystruct lvar; // lvar is local variable lvar.i = 0; lvar.j = 0; } int main(){ struct mystruct b; b = init(); … } Usually, a structure is too large to return in x0 or w0 ○ So, we return it in memory ○ Calling code provides memory on the stack to store the returned result The address of this memory is put into x8 prior to the function call x8 is called the “indirect result location register”. ○ The called subroutine writes to memory at this address, using x8 as a pointer for it Example: mystruct_i = 0 mystruct_j = 8 b_size = 16 alloc = -(16 + b_size) & -16 dealloc = -alloc b_s = 16 main: stp x29, x30, [sp, alloc]! mov x29, sp add x8, x29, b_s // calculate address of b bl init ldp x29, x30, [sp], dealloc ret define(lvar_base_r, x9) lvar_size = 16 alloc = -(16 + lvar_size) & -16 dealloc = -alloc lvar_s = 16 init: stp x29, x30, [sp, alloc]! mov x29, sp // Calculate lvar struct base address add lvar_base_r, x29, lvar_s str xzr, [lvar_base_r, mystruct_i] // i = 0 str xzr, [lvar_base_r, mystruct_j] // j = 0 ldr x10, [lvar_base_r, mystruct_i]// set in main str x10, [x8, mystruct_i] // b.i = lvar.i ldr x10, [lvar_base_r, mystruct_j] str x10, [x8, mystruct_j] // b.j = lvar.i ldp x29, x30, [sp], dealloc ret Optimising Leaf Subroutines Leaf subroutines do not call other subroutines ○ if there is no bl in a subroutine, it is a leaf ○ Leaf subroutines are leaf nodes on a structure diagram A frame record is not pushed onto the stack ○ Since the routine does not do a bl, the LR won’t change ○ Since the routine does not call a subroutine, the FP won’t change ○ Thus we can eliminate the usual stp/ldp instructions If one only uses the registers between x0-x7 and x9-x15, a stack frame is not pushed at all ○ No need to save/restore registers E.g. optimised cube function cube: mul w9, w0, w0 // w9 = x^2 mul w0, w9, w0 // w9 = x^3 ret Subroutines with > 8 Arguments Arguments beyond the 8th are passed on the stack The calling code allocates memory at the top of the stack, then writes the “spilled” arguments there ○ By convention, each argument is allocated 8 bytes no matter its size. Similar to how registers are always 8 bytes even if used for ints ○ Callee reads this memory using the appropriate offset E.g. C code: int main() { register int result; result = sum(10, 20, 30, 40, 50, 60, 70, 80, 90, 100); } int sum(int a1, int a2, int a3, int a4, int a5, int a6, int a7, int a8, int a9, int a10) { return a1+a2+a3+a4+a5+a6+a7+a8+a9+a10; } Assembly code: define(result_r, w19) spilled_mem_size = 16 alloc = -spilled_mem_size & -16 dealloc = -alloc.global main main: stp x29, x30, [sp, -16]! mov x29, sp mov w0, 10 mov w1, 20 mov w2, 30 mov w3, 40 mov w4, 50 mov w5, 60 mov w6, 70 mov w7, 80 // Allocate memory for args 9 and 10 add sp, sp, alloc // Write spilled arguments to the top of the stack mov w9, 90 str w9, [sp, 0] mov w9, 100 str w9, [sp, 8] // Call sum function bl sum mov result_r, w0 // Deallocate memory for spilled args add sp, sp, dealloc ldp x29, x30, [sp], dealloc ret arg9_s = 16 // starts at 16 because of frame pointer arg10_s = 24 sum: stp x29, x30, [sp, -16]! mov x29, sp add w0, w0, w1 add w0, w0, w2 add w0, w0, w3 add w0, w0, w4 add w0, w0, w5 add w0, w0, w6 add w0, w0, w7 // Read from 9th and 10th args ldr w9, [sp, arg9_s] add w0, w0, w9 ldr w9, [sp, arg10_s] add w0, w0, w9 ldp x29, x30, [sp], 16 ret Variables Local and static local In C, local (automatic) variables are always allocated in the stack frame for a function ○ Scope: local to block code where declared ○ Lifetime: life of the block of code Static local variables are different ○ Scope: block of code where declared ○ Lifetime: life of the program persist from call to call of the function E.g. C code: int count(){ static int value = 0 return ++value; } Cannot be stored on the stack frame for the function Stored in a separate section of RAM Global Scope: global (from declaration onwards) E.g.: int val; // ← global variable, not inside any function main(){ val = 3; … } int f(){ int a; a = val; } Stored in a separate section of memory Static global Scope: local to file (from declaration onwards) Lifetime: life of program Stored in a separate section of RAM text, data,.bss Sections Programs may allocate 3 sections of memory: ○ text Contains: Program text (machine code) Read-only, programmer-initialised data Is read-only memory Attempts to write to this memory cause a segmentation fault ○ data Read/write memory ○ bss block starting symbol Contains zero-initialised data (e.g. i = 0) Is read/write memory These sections are located in low memory, just after the section reserved for the OS kernel. Low OS text ← PC data bss heap v free memory ← SP stack ^ ← FP High Pseudo-ops indicate that what follows goes in a particular section of memory ○.text Default section when assembling ○.data ○.bss You can alternate between them Assembler uses a location counter for each section ○ Starts at 0, increases as instructions and data are processed ○ Final step of assembly gathers all code and data into appropriate sections When the OS loads the program into RAM: ○ text, data loaded first ○ bss zeroed External variables Non-local variables, allocated in data or bss ○ Used to implement global and static local variables as external variables Can be allocated and initialised using the pseudo-ops: ○.dword ○.word ○.hword ○.byte General form: ○ label: pseudo-op value[, value2, …] E.g.:.data a_m:.hword 23 // _m for things in memory (data or bss) b_m:.word (11*4) - 2 c_m:.dword 0 arr_m:.byte 10, 20, 30 Allocates 17 bytes in the data section: The labels represent 64-bit addresses ○ Use adrp and add to put teh address into a register ○ Then use ldr or str to access the variable ○ E.g. C code: int i = 2, j = 12, k =0 ; int main() { k=i+j … } ○ Assembly code:.data i_m:.word 2 // i (var in C code) _m (using memory) j_m:.word 12 k_m:.word 0.text.balign 4.global main main: stp x29, x30, [sp, -16]! mov x29, sp adrp x19, i_m add x19, x19, :lo12:i_m ldr w20, [x19] // w20 = i adrp x19, j_m add x19, x19, :lo12:j_m ldr w21, [x19] // w21 = j add w22, w20, w21 // w22 = i+j adrp x19, k_m add x19, x19, :lo12:k_m str w22, [x19] // k = w22.skip Uninitialized space can be allocated with the.skip pseudo-op ○ Eg: 10 element int array myarray:.skip 10*4 Use.global if the variable is to be made available to other compilation units ○ Eg:.globalmyvar_m The bss section usually only uses the.skip pseudo-op ○ All bss memory is zeroed before the program executes Initializing memory to non-zero values (.word,.hword, etc.) doesn’t make sense ○ Eg:.bss array_m:.skip 10*4 // int array c_m:.skip 1 // char h_m:.skip 2 // short int (halfword) Programmer-initialised constants are put into the.text section ○ Before or between functions; not in the body ○ Eg:.text.balign 4 func: stp … … ret con_m:.hword 42 // cannot be overwritten; constant_m.balign 4 func2: stp … … ret ASCII Character Set American standard code for information interchange ○ Encodes characters using 7 bits, stored in a byte ○ www.lookuptables.com ○ A: 0x41 ○ Z: 0x5A ○ a: 0x61 ○ z: 0x7A ○ 0: 0x30 ○ 9: 0x39 ○ Space: 0x20 ○ Tab: 0x9 ○ Period:0x2e In assembly, character constants can be denoted with: ○ hex code Eg: mov w19, 0x5A // Z ○ the character in single quotes Eg: mov w19, ‘Z’ // stores 0x5A Note: may interfere with m4 ○ In gdb, use p/c $w to print character Creating and addressing string literals A string is an array of characters Could be initialised in memory one byte at a time ○ e.g. “cheers” .byte ‘c’, ‘h’, ‘e’, ‘e’, ‘r’, ‘s’ Or use.ascii pseudo-op .ascii “cheers” In C, strings are null terminated ○ i.e. c h e e r s \0 ○ Could be done using two pseudo-ops:.ascii “cheers”.byte 0 ○ Or more conveniently with.asciz or.string:.asciz “cheers” A string literal is a read-only array of characters, allocated in the text section ○ In C code, delimited with “” ○ Eg: int main() { printf(“Hello, world!\n”); } ○ The literal usually has a label, which represents the address of the first character in the array.text fmt:.string “Hello, world!\n” // string literal.balign 4.global main main: stp x29, x30, [sp, -16]! mov x29, sp adrp x0, fmt // address of first char ‘H’ add x0, x0, :lo12:fmt // now in x0 bl printf < Notes missing here> External Array of Pointers Created with a list of labels E.g. C code: #include // Array of pointers to string literals char *season = (“spring”, “summer”, “fall”, “winter”); int main(){ register int i; for (i = 0; i < 4; i++){ printf(“season[%d] = %s\n”, i, season[i]); } return 0; } Equivalent assembly code define(i_r, w19) define(base_r, x20).text fmt: string “season[%d] = %s\n” spr_m:.string “spring” sum_m:.string “summer” fal_m:.string “fall” win_m:.string “winter”.data // create array of pointers in data section.balign 8 // must be double-word aligned; // not strictly necessary in armv8 season_m:.dwordspr_m, sum_m, fal_m, win_m.text.balign 4.global main main: stp x29, x30, [sp, -16]! mov x29, sp mov i_r, 0 b test top: adrp x0, fmt add x0, x0, :lo12:fmt // set up 1st arg mov w1, i_r // set up 2nd arg adrp base_r, season_m // calc aray base address add base_r, base_r, :lo12:season_m ldr x2, [base_r, i_r, sxtw 3] // set up 3rd arg bl printf add i_r, i_r, 1 test: cmp i_r, 4 b.lt top ldp x29, x30, [sp], 16 ret Command Line Arguments Allow you to pass values from the shell into your program In C: main(int argc, char *argv[]) ○ argc: argument count (number of arguments) ○ argv: argument vector (array of pointers to the arguments) Eg C code: myecho.c int main(int argc, char *argv[]){ register int i; for (i=0; i < argc; i++){ printf(“%s\n”, argv[i]); } return 0; } Sample run prompt>./myecho one two./myecho one two Assembly code // Note: argc is in w0 and argv is in x1 for main() define(i_r, w19) define(argc_r, w20) define(argv_r, x21) fmt:.string “%s\n”.balign 4.globalmain main: stp x29, x30, [sp, -16]! mov x29, sp mov argc_r, w0 // copy argc to safe register mov argv_r, x1 // copy argv to safe register mov i_r, 0 b test top: adrp x0, fmt add x0, x0, :lo12:fmt // set up 1st arg ldr x1, [argv_r,, i_r, sxtw 3] // set up 2nd arg bl printf add i_r, i_r, 1 test: cmp i_r, argc_r b.lt top ldp x29, x30, [sp], 16 ret Separate Compilation and Linking Using two different assembly code files Assembly code: first.s.balign4.globalmain main: stp x29, x30x, [sp, -16]! mov x29, sp adrp x19, a_m add x19, x19, :lo12:a_m // reference a_m from second.s ldr w0, [x19] bl myfunc // reference myfunc from second.s ldp x29, x30, [sp], 16 ret Assembly code: second.s.globala_m a_m:.word 44 // global constant.balign4.globalmyfunc // global function myfunc: stp x29, x30, [sp, -16]! mov x29, x30 sub w0, w0, 1 ldp x29, x30, [sp], 16 ret To assemble and link: as first.s -o first.o // invoke the assembler as second.s -o second.o gcc first.o second.o -o myexec To view in gdb: (gdb) x/8i main (gdb) x/wd a_m // w: read word, d: show as decimal (gdb) x/5i myfunc Source code is often divided into several modules ○ i.e. separate.c or.s files ○ Makes development of large projects easier Modules can be compiled into relocatable object code ○ Will be put into corresponding.o files ○ Eg: gcc -c myfile1.c // produces myfile1.o Object code is linked together to create an executable ○ Is done by ld (loader), usually invoked by gcc as the final step in the compilation process ○ Eg: gcc myfile.o myfile2.o -o myexec ○ ld resolves any references to external data or functions in other modules ○ Code and data are relocated in memory as necessary to form contiguous text, data, and bss sections … To assemble and link: as first.s -o first.o as second.s -o second.o gcc first.o second.o -o myexec This can be done more easily using a makefile C code can call functions written in Assembly ○ For optimization of sections of code Eg: C code: mymain.c #include int sum(int, int); int main(){ int i = 5, j = 10, result; result = sum(i,j); // defined in sum.s printf(“result = %d\n”, result); return 0; } Asm code: sum.s.balign 4.global sum sum: stp x29, x30, [sp, -16]! mov x29, sp add w0, w0, w1 ldp x29, x30, [sp], 16 ret To compile and link: gcc -c mymain.c // produces mymain.o in same directory as sum.s -o sum.o // can be before first gcc instr gcc mymain.o sum.o -o myprog // compiles final executable./myprog Input and Output A computer may do input and output in several ways ○ Interrupt-driven I/O Will be covered in cpsc 359 ○ Memory-mapped I/O ○ Port I/O ○ System I/O Our ARM servers are running a Linux OS therefore only system I/O is available ○ User-level programs communicate with external devices through OS Prevents malicious or accidental interactions that may damage the device eg: mov x8, 57 svc 0 Arguments to system calls are put in x0 - x5 Any return value is in x0 In Unix (Linux), all peripheral devices are represented as files ○ present a uniform interface for I/O Typical pattern: ○ Open file i.e. connect to device, get its file descriptor ○ Read from or write to file i.e. do device I/O, transferring bytes ○ Close the file File I/O involves interacting with secondary memory Standard I/O involves mouse, keyboard, display devices Opening a file: ○ C code: int fd = openat(int dirfd, const char *pathname, int flags, mode_t mode); dirft: directory file descriptor (used if path is relative) Can be set to AT_FDCWD (value -100), to indicate that the pathname is relative to the program’s current working directory pathname: relative or absolute pathname to a file flags: combination of constants indicating what will be done to the file’s data O_RDONLY 00 Read-only access O_WRONLY 01 Write-only access O_RDWR 02 Read/write access Optional: O_CREAT 0100 Create file if it does not exist O_EXCL 0200 Fail if file exists (with O_CREAT) O_TRUNC 01000 Truncate an existing file O_APPEND 02000 Append access mode: optional argument that specifies UNIX file permissions Required when creating a new file with specific permissions Specified in octal e.g. 0700 specifies read/write/exec permission for file owner, none for group or others fd: the returned file descriptor ls -l on error e.g. Assembly code: pm:.string “myfile.bin” … mov w0, -100 // first arg: use relative to cwd adrp x1, pn // second arg: pathname add x1, x1, :lo12:pn mov w2, 0 // third arg: readonly mov w3, 0 // fourth arg: not used mov x8, 56 // openat io request svc 0 // call system function cmp w0, 0 // error check b.ge open_ok … … Assembly code: buf_size = 32 alloc = -(16 + buf_size) & -16 dealloc = -alloc buf_s = 16 main: stp x29, x30, [sp, alloc]! mov x29, sp add x1, x29, buf_s // 2nd arg (ptr to buf) top: mov w0, 0 // 1st arg (stdin) mov x2, buf_size // 3rd arg (BUFSIZE) mov x8, 63 // read I/O request svc 0 // call sys function cmp x0, 0 // compare n_read and 0 b.le exit // exit loop if n_read is