Lec8-4471029-ISA (Tradeoff) PDF
Document Details
Uploaded by CuteWatermelonTourmaline
Kazi Nazrul University
Dohyung Kim
Tags
Summary
These lecture notes cover ISA tradeoffs in computer architecture. The document discusses various aspects of instruction set architectures (ISAs), including complex versus simple instructions, semantic gaps, and the implementation considerations. Topics also include different addressing modes and examples like X86 and Alpha instruction formats.
Full Transcript
ISA : Tradeoff 471029: Introduction to Computer Architecture 8th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU...
ISA : Tradeoff 471029: Introduction to Computer Architecture 8th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU 1 Tradeoffs: Soul of Computer Architecture ¢ Computer architecture is the science and art of making the appropriate trade-offs to meet a design point ¢ Soul of engineering in general § e.g.,) Programming language § Execution Speed vs. Ease of use § Low Run-time costs vs. Ease of debugging § Smaller code or loop unrolling § Compressed or uncompressed data (a space-time trade-off) § e.g.,) OS § System and user-level tradeoffs ¢ We’ll look into different tradeoff issues in computer architecture § ISA-level tradeoffs § Microarchitecture-level tradeoffs 2 Complex vs. Simple Instructions ¢ Complex instruction: An instructions does a lot of work § multiple operations § e.g., § Insert in a double linked list § Compute cos(x), Fast Fourier Transform § String copy ¢ Simple instruction: An instruction does small amount of work, it is a primitive using which complex operations can be built § e.g., § Add § Xor § Shift 3 ISA-level Tradeoffs: Semantic Gap ¢ Where to place the ISA? Semantic gap § Closer to high-level language (HLL) or closer to hardware control signals? à Complex vs. simple instructions § RISC vs. CISC vs. HLL machines § FFT, QUICKSORT, POLY, FP instructions? § VAX INDEX instruction (array access with bounds checking) ¢ How High or Low Can You Go? § Very large semantic gap § Each instruction specifies the complete set of control signals in the machine § Compiler generates control signals § Open microcode (John Cocke, circa 1970s) – Gave way to optimizing compiler § Very small semantic gap § ISA is (almost) the same as high-level language § Java machines, LISP machines, object-orientied machnes, capability- based machines 4 ISA-level Tradeoffs: Semantic Gap ¢ Where to place the ISA? Semantic gap (cont’d) § Tradeoffs: § Simple compiler, complex hardware vs. complex complier, simple hardware – Caveat: Translation (indirection) can change the tradeoff! § Burden of backward compatibility § Performance? – Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? – Instruction size, code size 5 X86: Small Semantic Gap: String Operations ¢ An instruction operates on a string § Move one string of arbitrary length to another location (e.g., string- copy) § Compare two strings ¢ Enabled by the ability to specify repeated execution of an instruction (in the ISA) § Using a “prefix” called REP prefix ¢ Example: REP MOVS instruction § Only two bytes: REP prefix bytes and MOVS opcode bytes (F3 A4) § Implicit source and destination registers pointing to the two strings (ESI, EDI) § Implicit count register (ECX) specifices how long the string is 6 X86: Small Semantic Gap: String Operations REP MOVS DEST SRC How many instructions does this take in RISC(e.g., Alpha, MIPS, ARM…)? 7 Small Semantic Gap Examples in VAX ¢ FIND FIRST § Find the first set bit in a bit field § Help OS resource allocation operations ¢ SAVE CONTEXT, LOAD CONTEXT § Special context switching instructions ¢ INSQUEUE, REMQUEUE § Operations on doubly linked list ¢ INDEX § Array access with bounds checking ¢ STRING Operations § Compare strings, find sub strings, … ¢ Cyclic Redundancy Check Instruction ¢ Digital Equipment Corp. “VAX11 780 Architecture Handbook”, 1977-78 8 [ASIDE] F00F bugs on Intel Pentium ¢ As the implementation of ISA instructions(or micro-architecture) is complicated, the attack surface become extended to H/W(or micro- architecture). § e.g., Cache side-channel attack, Meltdown, Spectre, Intel FDIV, AMD FMA3 ¢ One example of a paradigm shift (S/W à H/W). ¢ F00F Instruction bugs on Intel Pentium (discovered in 1997) § ‘cmpxchg8b m64’ § Compare EDX:EAX with m64(data in memory) § if equal, set ZF and load ECX:EAX into m64 § Else, clear ZF and load m64 into EDX:EAX § 0F C7 à two bytes CISC instruction but do a lot of work! § ‘cmpxchg8b eax’ § generate #UD (i.e., the destination operand is not a memory location) § ‘lock cmpxchg8b eax’ § F0 0F C7 C8 § LOCK prefixes are only allowed on memory-based read-modify-write instructions. § So, a LOCK prefix on the register-based ‘cmpxchg8b eax’ should generate an invalid opcode exception § But the Pentium locks up and freezes the entire computer when it encounters this instruction. – Imagine that an adversary execute this 4 bytes instruction on a virtual machine in a cloud system 9 Small versus Large Semantic Gap ¢ CISC vs. RISC § Complex instruction set computer à complex instructions §Initially motivated by “not good enough” code generation § Reduced instruction set computer à simple instructions § John Cocke, mid 1970s, IBM 801 – Goal: enable better compiler control and optimization ¢ RISC motivated by § Memory stall (no work dones in a complex instruction when there is a memory stall?) § When is this correct? § Simplifying the hardware à lower cost, higher frequency § Enabling the compiler to optimize the code better § Find fine-grained parallelism to reduce stalls 10 Small versus Large Semantic Gap (cont’d) ¢ Advantages of Small Semantic Gap (Complex instructions) + Denser encoding à smaller code size à saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler ¢ disadvantages - Larger chunks of work à compiler has less opportunity to optimize - More complex hardware à translation to control signals and optimization needs to be done by hardware 11 A Note on ISA Evolution ¢ ISAs have evolved to reflect/satisfy the concerns of the day ¢ Examples: § Limited on-chip and off-chip memory size § Limited compiler optimization technology § Limited memory bandwidth § Need for specialization in important application (e.g., MMX) ¢ Use of translation (in HW and SW) enabled underlying implementations to be similar, regardless of the ISA § Concept of dynamic/static interface: translation/interpretation § Contrast it with hardware/software interface 12 Effect of Translation ¢ One can translate from one ISA to another ISA to change the semantic gap tradeoffs § ISA (virtual ISA) à Implementation ISA ¢ Examples § Intel’s and AMD’s x86 implementations translate x86 instructions into programmer-invisible micro-operations (simple instructions) in hardware § Transmeta’s x86 implementations translated x86 instructions into “secret” VLIW(Very Long Instruction Word) in software (code morphing software) ¢ Think about the tradeoffs 13 [ Klaiber, “The Technology Hardware-Based Translation Behind Crusoe Processors,” Transmeta White Paper 2000. ] ¢ Microcode: a layer of HW-level instructions between ISA and CPU H/W ¢ The implementation highly depend on vendors (e.g., Intel, AMD) [ Philipp Koppe, “Reverse Engineering x86 Processor Microcode”, USENIX Security 2017 ] 14 [Aside] Microcode bugs ¢ Examples § “Reverse engineering x86 processor microcode”, USENIX Security 2017 § Remote microcode attack § Cryptographic Microcode Trojans § ‘Broken hyper-threading” in Intel Kaby Lake, 2017 (e.g., system crash) § RDRAND instruction in Ryzen 3000, 2019 § It caused Ryzen 3000 users to never get any proper random numbers at all. Both problems caused lockups in Linux operating systems using systemd 15 Software-Based Translation Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000. 16 ISA-level Tradeoffs: Instruction Length ¢ Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) ¢ Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently ¢ Tradeoffs § Code size (memory space, bandwidth, latency) vs. hardware complexity § ISA extensibility and expressiveness § Performance? Smaller code vs. imperfect decode 17 ISA-level Trandeoffs: Uniform Decode ¢ Uniform decode: Same bits in each instruction correspond to the same meaning § Opcode is always in the same location § Ditto operand specifiers, immediate values, … § Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space ¢ Non-uniform decode § E.g., opcode can be the 1st-3th bytes in x86 + More compact and powerful instruction format -- More complex decode logic 18 x86 vs. Alpha Instruction Formats ¢ x86: ¢ Alpha: 19 A Note on Length and Uniformity ¢ Uniform decode usually goes with fixed length ¢ In a variable length ISA, uniform decode can be a property of instructions of the same length § It is hard to think of it as a property of instructions of different lengths 20 ISA-level Tradeoffs: Number of Registers ¢ Affects: § Number of bits used for encoding register address § Number of values kept in fast storage (register file) § (uarch) Size, access time, power consumption of register file ¢ Large number of registers: + Enables better register allocation (and optimizations) by compiler à fewer saves/restores -- Larger instruction size -- Larger register file size 21 ISA-level Tradeoffs: Addressing Modes ¢ Addressing mode specifies how to obtain an operand of an instruction § Register § Immediate § Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) ¢ More modes: + help better supports programming constructs (arrays, pointer-based access) -- make it harder for the architect to design -- too many choices for the compiler? § Many ways to do the same thing complicates compiler design 22 x86 vs. Alpha Instruction Formats ¢ x86: ¢ Alpha: 23 x86 register indirect Memory absolute SIB + displacement register + displacement register Register 24 x86 indexed: (base + index) scaled: (base + index*4) 25 X86 SIB-D Addressing Mode 26 X86 Datasheet: Suggested Uses of Addressing Modes Static address Dynamic storage Arrays Records 27 X86 Datasheet: Suggested Uses of Addressing Modes Static arrays w/ fixed-size elements 2D arrays, Structure 2D arrays 28 A Note on RISC vs. CISC ¢ Usually, ¢ RISC § Simple instructions § Fixed length § Uniform decode § Few addressing modes ¢ CISC § Complex instructions § Variable length § Non-uniform decode § Many addressing modes 29 Other Example ISA-level Tradeoffs ¢ Condition codes vs. not ¢ VLIW vs. single instruction ¢ Precise vs. imprecise exceptions ¢ Virtual memory vs. not ¢ Unaligned access vs. not ¢ Hardware interlocks vs. software-guaranteed interlocking ¢ Software vs. hardware managed page fault handling ¢ Cache coherence (hardware vs. software) ¢ … 30 Back to Programmer vs. (Micro)architect ¢ Many ISA features designed to aid programmers ¢ But, complicate the hardware designer’s job ¢ Virtual memory § vs. overlay programming § Should the programmer be concerned about the size of code blocks fitting physical memory? ¢ Addressing modes ¢ Unaligned memory access § Compiler/programmer needs to align data 31 MIPS: Aligned Access MSB byte-3 byte-2 byte-1 byte-0 LSB byte-7 byte-6 byte-5 byte-4 ¢ LW/SW alignment restriction: 4-byte word-alignment § not designed to fetch memory bytes not within a word boundary § not designed to rotate unaligned bytes into registers ¢ Provide separate opcodes for the “infrequent” case A B C D LWL rd 6(r0) à byte-6 byte-5 byte-4 D LWR rd 3(r0) à byte-6 byte-5 byte-4 byte-3 § LWL/LWR is slower § Note LWL and LWR still fetch within word boundary 32 X86: Unaligned Access ¢ LD/ST instructions automatically align data that spans a “word” boundary ¢ Programmer/compiler does not need to worry about where data is stored (whether or not in a word-aligned location) 33 X86: Unaligned Access (cont’d) 34