Lec8-4471029-ISA (Tradeoff) PDF

ISA : Tradeoff 471029: Introduction to Computer Architecture 8th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU 1 Tradeoffs: Soul of Computer Architecture ¢ Computer architecture is the science and art of making the appropriate trade-offs to meet a design point ¢ Soul of engineering in general § e.g.,) Programming language § Execution Speed vs. Ease of use § Low Run-time costs vs. Ease of debugging § Smaller code or loop unrolling § Compressed or uncompressed data (a space-time trade-off) § e.g.,) OS § System and user-level tradeoffs ¢ We’ll look into different tradeoff issues in computer architecture § ISA-level tradeoffs § Microarchitecture-level tradeoffs 2 Complex vs. Simple Instructions ¢ Complex instruction: An instructions does a lot of work § multiple operations § e.g., § Insert in a double linked list § Compute cos(x), Fast Fourier Transform § String copy ¢ Simple instruction: An instruction does small amount of work, it is a primitive using which complex operations can be built § e.g., § Add § Xor § Shift 3 ISA-level Tradeoffs: Semantic Gap ¢ Where to place the ISA? Semantic gap § Closer to high-level language (HLL) or closer to hardware control signals? à Complex vs. simple instructions § RISC vs. CISC vs. HLL machines § FFT, QUICKSORT, POLY, FP instructions? § VAX INDEX instruction (array access with bounds checking) ¢ How High or Low Can You Go? § Very large semantic gap § Each instruction specifies the complete set of control signals in the machine § Compiler generates control signals § Open microcode (John Cocke, circa 1970s) – Gave way to optimizing compiler § Very small semantic gap § ISA is (almost) the same as high-level language § Java machines, LISP machines, object-orientied machnes, capability- based machines 4 ISA-level Tradeoffs: Semantic Gap ¢ Where to place the ISA? Semantic gap (cont’d) § Tradeoffs: § Simple compiler, complex hardware vs. complex complier, simple hardware – Caveat: Translation (indirection) can change the tradeoff! § Burden of backward compatibility § Performance? – Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? – Instruction size, code size 5 X86: Small Semantic Gap: String Operations ¢ An instruction operates on a string § Move one string of arbitrary length to another location (e.g., string- copy) § Compare two strings ¢ Enabled by the ability to specify repeated execution of an instruction (in the ISA) § Using a “prefix” called REP prefix ¢ Example: REP MOVS instruction § Only two bytes: REP prefix bytes and MOVS opcode bytes (F3 A4) § Implicit source and destination registers pointing to the two strings (ESI, EDI) § Implicit count register (ECX) specifices how long the string is 6 X86: Small Semantic Gap: String Operations REP MOVS DEST SRC How many instructions does this take in RISC(e.g., Alpha, MIPS, ARM…)? 7 Small Semantic Gap Examples in VAX ¢ FIND FIRST § Find the first set bit in a bit field § Help OS resource allocation operations ¢ SAVE CONTEXT, LOAD CONTEXT § Special context switching instructions ¢ INSQUEUE, REMQUEUE § Operations on doubly linked list ¢ INDEX § Array access with bounds checking ¢ STRING Operations § Compare strings, find sub strings, … ¢ Cyclic Redundancy Check Instruction ¢ Digital Equipment Corp. “VAX11 780 Architecture Handbook”, 1977-78 8 [ASIDE] F00F bugs on Intel Pentium ¢ As the implementation of ISA instructions(or micro-architecture) is complicated, the attack surface become extended to H/W(or micro- architecture). § e.g., Cache side-channel attack, Meltdown, Spectre, Intel FDIV, AMD FMA3 ¢ One example of a paradigm shift (S/W à H/W). ¢ F00F Instruction bugs on Intel Pentium (discovered in 1997) § ‘cmpxchg8b m64’ § Compare EDX:EAX with m64(data in memory) § if equal, set ZF and load ECX:EAX into m64 § Else, clear ZF and load m64 into EDX:EAX § 0F C7 à two bytes CISC instruction but do a lot of work! § ‘cmpxchg8b eax’ § generate #UD (i.e., the destination operand is not a memory location) § ‘lock cmpxchg8b eax’ § F0 0F C7 C8 § LOCK prefixes are only allowed on memory-based read-modify-write instructions. § So, a LOCK prefix on the register-based ‘cmpxchg8b eax’ should generate an invalid opcode exception § But the Pentium locks up and freezes the entire computer when it encounters this instruction. – Imagine that an adversary execute this 4 bytes instruction on a virtual machine in a cloud system 9 Small versus Large Semantic Gap ¢ CISC vs. RISC § Complex instruction set computer à complex instructions §Initially motivated by “not good enough” code generation § Reduced instruction set computer à simple instructions § John Cocke, mid 1970s, IBM 801 – Goal: enable better compiler control and optimization ¢ RISC motivated by § Memory stall (no work dones in a complex instruction when there is a memory stall?) § When is this correct? § Simplifying the hardware à lower cost, higher frequency § Enabling the compiler to optimize the code better § Find fine-grained parallelism to reduce stalls 10 Small versus Large Semantic Gap (cont’d) ¢ Advantages of Small Semantic Gap (Complex instructions) + Denser encoding à smaller code size à saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler ¢ disadvantages - Larger chunks of work à compiler has less opportunity to optimize - More complex hardware à translation to control signals and optimization needs to be done by hardware 11 A Note on ISA Evolution ¢ ISAs have evolved to reflect/satisfy the concerns of the day ¢ Examples: § Limited on-chip and off-chip memory size § Limited compiler optimization technology § Limited memory bandwidth § Need for specialization in important application (e.g., MMX) ¢ Use of translation (in HW and SW) enabled underlying implementations to be similar, regardless of the ISA § Concept of dynamic/static interface: translation/interpretation § Contrast it with hardware/software interface 12 Effect of Translation ¢ One can translate from one ISA to another ISA to change the semantic gap tradeoffs § ISA (virtual ISA) à Implementation ISA ¢ Examples § Intel’s and AMD’s x86 implementations translate x86 instructions into programmer-invisible micro-operations (simple instructions) in hardware § Transmeta’s x86 implementations translated x86 instructions into “secret” VLIW(Very Long Instruction Word) in software (code morphing software) ¢ Think about the tradeoffs 13 [ Klaiber, “The Technology Hardware-Based Translation Behind Crusoe Processors,” Transmeta White Paper 2000. ] ¢ Microcode: a layer of HW-level instructions between ISA and CPU H/W ¢ The implementation highly depend on vendors (e.g., Intel, AMD) [ Philipp Koppe, “Reverse Engineering x86 Processor Microcode”, USENIX Security 2017 ] 14 [Aside] Microcode bugs ¢ Examples § “Reverse engineering x86 processor microcode”, USENIX Security 2017 § Remote microcode attack § Cryptographic Microcode Trojans § ‘Broken hyper-threading” in Intel Kaby Lake, 2017 (e.g., system crash) § RDRAND instruction in Ryzen 3000, 2019 § It caused Ryzen 3000 users to never get any proper random numbers at all. Both problems caused lockups in Linux operating systems using systemd 15 Software-Based Translation Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000. 16 ISA-level Tradeoffs: Instruction Length ¢ Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) ¢ Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently ¢ Tradeoffs § Code size (memory space, bandwidth, latency) vs. hardware complexity § ISA extensibility and expressiveness § Performance? Smaller code vs. imperfect decode 17 ISA-level Trandeoffs: Uniform Decode ¢ Uniform decode: Same bits in each instruction correspond to the same meaning § Opcode is always in the same location § Ditto operand specifiers, immediate values, … § Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space ¢ Non-uniform decode § E.g., opcode can be the 1st-3th bytes in x86 + More compact and powerful instruction format -- More complex decode logic 18 x86 vs. Alpha Instruction Formats ¢ x86: ¢ Alpha: 19 A Note on Length and Uniformity ¢ Uniform decode usually goes with fixed length ¢ In a variable length ISA, uniform decode can be a property of instructions of the same length § It is hard to think of it as a property of instructions of different lengths 20 ISA-level Tradeoffs: Number of Registers ¢ Affects: § Number of bits used for encoding register address § Number of values kept in fast storage (register file) § (uarch) Size, access time, power consumption of register file ¢ Large number of registers: + Enables better register allocation (and optimizations) by compiler à fewer saves/restores -- Larger instruction size -- Larger register file size 21 ISA-level Tradeoffs: Addressing Modes ¢ Addressing mode specifies how to obtain an operand of an instruction § Register § Immediate § Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) ¢ More modes: + help better supports programming constructs (arrays, pointer-based access) -- make it harder for the architect to design -- too many choices for the compiler? § Many ways to do the same thing complicates compiler design 22 x86 vs. Alpha Instruction Formats ¢ x86: ¢ Alpha: 23 x86 register indirect Memory absolute SIB + displacement register + displacement register Register 24 x86 indexed: (base + index) scaled: (base + index*4) 25 X86 SIB-D Addressing Mode 26 X86 Datasheet: Suggested Uses of Addressing Modes Static address Dynamic storage Arrays Records 27 X86 Datasheet: Suggested Uses of Addressing Modes Static arrays w/ fixed-size elements 2D arrays, Structure 2D arrays 28 A Note on RISC vs. CISC ¢ Usually, ¢ RISC § Simple instructions § Fixed length § Uniform decode § Few addressing modes ¢ CISC § Complex instructions § Variable length § Non-uniform decode § Many addressing modes 29 Other Example ISA-level Tradeoffs ¢ Condition codes vs. not ¢ VLIW vs. single instruction ¢ Precise vs. imprecise exceptions ¢ Virtual memory vs. not ¢ Unaligned access vs. not ¢ Hardware interlocks vs. software-guaranteed interlocking ¢ Software vs. hardware managed page fault handling ¢ Cache coherence (hardware vs. software) ¢ … 30 Back to Programmer vs. (Micro)architect ¢ Many ISA features designed to aid programmers ¢ But, complicate the hardware designer’s job ¢ Virtual memory § vs. overlay programming § Should the programmer be concerned about the size of code blocks fitting physical memory? ¢ Addressing modes ¢ Unaligned memory access § Compiler/programmer needs to align data 31 MIPS: Aligned Access MSB byte-3 byte-2 byte-1 byte-0 LSB byte-7 byte-6 byte-5 byte-4 ¢ LW/SW alignment restriction: 4-byte word-alignment § not designed to fetch memory bytes not within a word boundary § not designed to rotate unaligned bytes into registers ¢ Provide separate opcodes for the “infrequent” case A B C D LWL rd 6(r0) à byte-6 byte-5 byte-4 D LWR rd 3(r0) à byte-6 byte-5 byte-4 byte-3 § LWL/LWR is slower § Note LWL and LWR still fetch within word boundary 32 X86: Unaligned Access ¢ LD/ST instructions automatically align data that spans a “word” boundary ¢ Programmer/compiler does not need to worry about where data is stored (whether or not in a word-aligned location) 33 X86: Unaligned Access (cont’d) 34

Lec8-4471029-ISA (Tradeoff) PDF

Document Details

Tags

Related

Summary

Full Transcript