Podcast
Questions and Answers
Parallel computing involves only bit-level parallelism.
Parallel computing involves only bit-level parallelism.
False (B)
Parallelism's acceleration of computing speeds has only become apparent in recent years.
Parallelism's acceleration of computing speeds has only become apparent in recent years.
False (B)
Developing parallel hardware and software has consistently been less time-intensive than developing serial hardware and software.
Developing parallel hardware and software has consistently been less time-intensive than developing serial hardware and software.
False (B)
Moore's Law indicates that the number of transistors on integrated circuits doubles precisely every year.
Moore's Law indicates that the number of transistors on integrated circuits doubles precisely every year.
According to Moore's Law, the number of components per integrated circuit for minimum cost was predicted to be about 65,000 by 1975.
According to Moore's Law, the number of components per integrated circuit for minimum cost was predicted to be about 65,000 by 1975.
DRAM access times have consistently improved at a faster rate than high-end processor speeds over the past decade.
DRAM access times have consistently improved at a faster rate than high-end processor speeds over the past decade.
DRAM stands for 'diminished random access memory'.
DRAM stands for 'diminished random access memory'.
In applications with large data volumes, analyses must always be performed on the network directly, rather than moving the data.
In applications with large data volumes, analyses must always be performed on the network directly, rather than moving the data.
Parallel computing exclusively focuses on cost reduction.
Parallel computing exclusively focuses on cost reduction.
Parallel computing applications in engineering include optimization of airfoil lift, drag, and stability.
Parallel computing applications in engineering include optimization of airfoil lift, drag, and stability.
Parallel computing has absolutely no applications in the field of astrophysics.
Parallel computing has absolutely no applications in the field of astrophysics.
Parallel computers are currently not used in financial sectors like Wall Street.
Parallel computers are currently not used in financial sectors like Wall Street.
Embedded systems are moving away from the use of distributed control algorithms.
Embedded systems are moving away from the use of distributed control algorithms.
The scope of parallelism only addresses the processor component of conventional architectures.
The scope of parallelism only addresses the processor component of conventional architectures.
Server applications primarily use aggregate throughput instead of aggregate network bandwidth.
Server applications primarily use aggregate throughput instead of aggregate network bandwidth.
Microprocessor clock speeds have not significantly changed in recent decades.
Microprocessor clock speeds have not significantly changed in recent decades.
Processors are solely classified as microprocessors.
Processors are solely classified as microprocessors.
The primary function of a transistor is to store data.
The primary function of a transistor is to store data.
Pipelining fully eliminates instruction execution time.
Pipelining fully eliminates instruction execution time.
In pipelining, an instruction can be executed while the next one is being decoded and the next one is being fetched.
In pipelining, an instruction can be executed while the next one is being decoded and the next one is being fetched.
Conventional processors now use shallow pipelines with few stages.
Conventional processors now use shallow pipelines with few stages.
The penalty of a pipeline misbehavior decreases with the pipeline's depth.
The penalty of a pipeline misbehavior decreases with the pipeline's depth.
Superscalar execution involves performing parts of a single instruction at the same time.
Superscalar execution involves performing parts of a single instruction at the same time.
Resource Dependency in superscalar execution means that one operation's result is an input to the next.
Resource Dependency in superscalar execution means that one operation's result is an input to the next.
Branch Dependency in superscalar execution can always be predicted a-priori.
Branch Dependency in superscalar execution can always be predicted a-priori.
In superscalar processors, the scheduler looks at a small number of instructions.
In superscalar processors, the scheduler looks at a small number of instructions.
In-order issue in superscalar execution means instructions are issued regardless of data dependencies.
In-order issue in superscalar execution means instructions are issued regardless of data dependencies.
Vertical waste in superscalar execution occurs when some functional units are utilized during a cycle.
Vertical waste in superscalar execution occurs when some functional units are utilized during a cycle.
VLIW processors rely on runtime analysis.
VLIW processors rely on runtime analysis.
Latency is the rate at which data can be pumped to the processor by the memory system.
Latency is the rate at which data can be pumped to the processor by the memory system.
Low latency always means high bandwidth.
Low latency always means high bandwidth.
Memory bandwidth is independent of the memory bus.
Memory bandwidth is independent of the memory bus.
Increasing the block size of words always reduces the latency of the system.
Increasing the block size of words always reduces the latency of the system.
When data is accessed in a single memory, it fetches two consecutive words into the vector.
When data is accessed in a single memory, it fetches two consecutive words into the vector.
In column-major data access, the matrix is traversed in a row order thus resulting in more efficient performance.
In column-major data access, the matrix is traversed in a row order thus resulting in more efficient performance.
Processing units in parallel computers can only operate under the centralized control of a single processor.
Processing units in parallel computers can only operate under the centralized control of a single processor.
In SIMD, each processor has its own control unit executing different instructions.
In SIMD, each processor has its own control unit executing different instructions.
RISC computers use complex instructions.
RISC computers use complex instructions.
In MIMD, multiple processors asynchronously issuing instructions
In MIMD, multiple processors asynchronously issuing instructions
In static networks, communication links are connected to one another dynamically.
In static networks, communication links are connected to one another dynamically.
Flashcards
Parallel Computing
Parallel Computing
A type of computation where multiple calculations or processes are carried out jointly, dividing large problems into smaller ones to be solved simultaneously.
Bit-level Parallelism
Bit-level Parallelism
A type of parallelism that focuses on manipulating data at the bit level to perform multiple operations simultaneously.
Instruction-level Parallelism
Instruction-level Parallelism
A type of parallelism that allows multiple instructions from a program to be executed simultaneously, improving the execution speed.
Data Parallelism
Data Parallelism
Signup and view all the flashcards
Task Parallelism
Task Parallelism
Signup and view all the flashcards
Motivation for Parallelism
Motivation for Parallelism
Signup and view all the flashcards
Moore's Law
Moore's Law
Signup and view all the flashcards
Memory/Disk Speed Argument
Memory/Disk Speed Argument
Signup and view all the flashcards
Latency
Latency
Signup and view all the flashcards
Bandwidth
Bandwidth
Signup and view all the flashcards
Dynamic Random Access Memory (DRAM)
Dynamic Random Access Memory (DRAM)
Signup and view all the flashcards
Data Communication Argument
Data Communication Argument
Signup and view all the flashcards
Goal of applications of parallelism
Goal of applications of parallelism
Signup and view all the flashcards
Use of parallelism in High-Speed Circuit design
Use of parallelism in High-Speed Circuit design
Signup and view all the flashcards
Scientific Applications of Parallel Computing
Scientific Applications of Parallel Computing
Signup and view all the flashcards
Parallelism and Large Scale Servers
Parallelism and Large Scale Servers
Signup and view all the flashcards
Parallelism in Computer Systems
Parallelism in Computer Systems
Signup and view all the flashcards
Scope of Parallelism
Scope of Parallelism
Signup and view all the flashcards
Microprocessor Gains
Microprocessor Gains
Signup and view all the flashcards
Processor Resource Use
Processor Resource Use
Signup and view all the flashcards
Pipelining
Pipelining
Signup and view all the flashcards
Scheduling branch dependency
Scheduling branch dependency
Signup and view all the flashcards
VLIW
VLIW
Signup and view all the flashcards
Improving Memory Latency
Improving Memory Latency
Signup and view all the flashcards
Role of Caches
Role of Caches
Signup and view all the flashcards
Memory bandwidth determination
Memory bandwidth determination
Signup and view all the flashcards
Multicore Processors
Multicore Processors
Signup and view all the flashcards
SIMD
SIMD
Signup and view all the flashcards
MIMD
MIMD
Signup and view all the flashcards
Supercomputer
Supercomputer
Signup and view all the flashcards
Shared Memory
Shared Memory
Signup and view all the flashcards
Distributed Memory
Distributed Memory
Signup and view all the flashcards
SISD
SISD
Signup and view all the flashcards
Multiple Instruction Stream-Multiple Data Stream (MIMD)
Multiple Instruction Stream-Multiple Data Stream (MIMD)
Signup and view all the flashcards
SPMD
SPMD
Signup and view all the flashcards
Message Passing Architecture
Message Passing Architecture
Signup and view all the flashcards
Dynamic Network
Dynamic Network
Signup and view all the flashcards
Bitsention Width
Bitsention Width
Signup and view all the flashcards
Primary Forms of data excahnge
Primary Forms of data excahnge
Signup and view all the flashcards
Startup Time
Startup Time
Signup and view all the flashcards
Study Notes
Introduction to Parallel Computing
- Parallel computing involves multiple processes or calculations carried out simultaneously.
- This approach is used to divide large problems into smaller, solvable parts, enabling faster solutions.
- Different categories of parallel computing exist, including bit-level, instruction-level, data, and task parallelism.
Topic Overview
- Motivating factors for using parallelism.
- Examines the scope of where parallel computing can be applied.
Motivation for Parallelism
- Parallelism has long been recognized for its ability to speed up computations.
- Commercial applications benefit from the multiplicity of data paths and increased access to storage elements provided by parallelism.
- Parallel platforms offer scalable performance and lower costs, making them suitable for a wide range of applications.
- Initially, creating parallel hardware and software was time-consuming and demanding.
- Despite improving uniprocessor speeds, hardware design trends suggest uniprocessor architectures may not sustain performance increases due to physical and computational limits.
- Standardized parallel programming environments, libraries, and hardware have made parallel solutions faster to implement.
The Computational Power Argument
- Moore's Law: The number of transistors on integrated circuits doubles about every two years, leading to increased processing power and better performance.
- Serial processors rely heavily on implicit parallelism.
The Memory/Disk Speed Argument
- Processor speeds have risen ~40% annually over the last decade, DRAM access times have only improved ~10% annually, causing performance bottlenecks.
- Parallel platforms offer increased bandwidth to the memory system.
- DRAM is dynamic random access memory found in RAM and GPUs. It uses capacitors to store bits of data within an integrated circuit.
- Parallel platforms provide higher aggregate caches.
- Principles of parallel algorithm design apply to memory optimization.
The Data Communication Argument
- The Internet is evolving into a large computing platform.
- Many applications, especially in databases and data mining, involve data volumes too large to move.
- Data analysis must occur over the network using parallel techniques.
Scope of Parallel Computing Applications
- Parallelism addresses a wide range of applications across diverse domains, driven by various motivations.
- The primary goal for improved applications is achieving better performance-to-cost ratios.
Applications in Engineering and Design
- Airfoil design optimizes lift, drag, and stability.
- Internal combustion engine design optimizes charge distribution and burn efficiency.
- High-speed circuit design optimizes layouts for delays and capacitive/inductive effects.
- Structural design optimizes structural integrity, design parameters, and cost.
- Micro- and nano-scale systems facilitate design and simulation.
- Process optimization and operations research are key application areas.
Scientific Applications
- Functional and structural characterization of genes and proteins is achieved through scientific applications.
- Advances in computational physics and chemistry explore new materials, understand chemical pathways, and create more efficient processes.
- Astrophysics applications study galaxy evolution, thermonuclear processes, and analyze extensive telescope datasets.
- Weather modeling, mineral prospecting, and flood prediction are important scientific applications.
- Bioinformatics and astrophysics present challenges in analyzing extremely large datasets.
Commercial Applications
- Parallel computers play a vital role in powering Wall Street.
- Data mining and analysis are used to optimize business and marketing decisions.
- Large-scale servers, such as mail and web servers, often use parallel platforms.
- Information retrieval and search applications are powered by large clusters using parallelism.
Applications in Computer Systems
- Core uses of parallel computing techniques include network intrusion detection, cryptography, and multiparty computations.
- Embedded systems rely on distributed control algorithms increasingly.
- Modern automobiles use many processors to optimize handling and performance.
- Peer-to-peer networks use overlay networks and algorithms from parallel computing.
Scope of Parallelism
- Conventional architectures have a processor, memory system, and datapath.
- Performance bottlenecks exist in each component.
- Parallelism addresses each component significantly.
- Different applications use different aspects of parallelism such as data-intensive apps which use high aggregate throughput, server apps which use high aggregate network bandwidth, while scientific apps use high processing and memory system performance.
Implicit Parallelism: Trends in Microprocessor Architectures
- Microprocessor clock speeds have significantly increased over the past two decades.
- Different types of processors include microprocessors, microcontrollers, embedded processors, and digital signal processors.
- CPUs are classified as single-core, dual-core, quad-core, hexa-core, octa-core, and deca-core processors.
- Current processors execute multiple instructions in the same cycle, using multiple functional units due to higher device integration and more transistors.
- Transistors amplify electric current in a circuit.
Pipelining and Superscalar Execution
- Pipelining improves performance by overlapping instruction execution stages.
- Instructions can be executed, decoded, and fetched simultaneously.
- This process resembles an assembly line in manufacturing.
- Pipelining improves performance by overlapping instruction execution stages.
- Instructions can be executed, decoded, and fetched simultaneously.
- This process resembles an assembly line in manufacturing.
- Pipelining has limitations with processors relying on deep pipelines which can result in every 5-6th instruction being a conditional jump
- These conditional jumps can impact depth of the pipeline requiring instructions being flushed
- Multiple pipelines alleviate the bottlenecks.
Superscalar Execution
- Instruction scheduling is determined by true data dependency, resource dependency, and branch dependency.
- The scheduler looks at instructions in a queue and determine how many can be executed concurrently based on several factors.
- The complexity of hardware is an important constraint on superscalar processors.
Superscalar Execution: Issue Mechanisms
- Instructions are issued in the order encountered known as in-order issue.
- If an instruction can't be issued due to data dependency, only one instruction is issued in the cycle.
- Instructions can be issued out of order this is known as dynamic issue.
- When the second instruction has data dependencies with the first, the third can be co-scheduled.
- In-order issue is generally limited in it's performance.
Superscalar Execution: Efficiency Considerations
- Functional units cannot be kept active continually.
- Vertical waste which occurs when no functional units are utilized during a cycle
- Horizontal waste which occurs when functional units are utilized during a cycle.
- Performance of superscalar processors is limited by parallelism in instruction traces, or when the scheduler can't extract parallelism.
Very Long Instruction Word (VLIW) Processors
- Hardware cost and complexity of the superscalar scheduler is a major consideration in processor design.
- VLIW processors analyze code at compile time to bundle instructions for concurrent execution.
- The core concept was used in the Multiflow Trace machine around 1984.
- Variants of this concept can be found in Intel IA64 processors.
- Typical VLIW processors allow 4-way to 8-way parallelism.
Limitations of Memory System Performance
- Memory system performance, not processor speed, often limits application performance.
- Memory system performance is gauged by latency and bandwidth.
- Memory Latency: The time it takes to get data once memory is requested by the process
- Bandwidth: The rate to which data is pumped into the processor by the memory system.
Memory System Performance: Bandwidth and Latency
- It is important to differentiate between latency and bandwidth.
- Firehose example: if it takes 2 seconds to turn on the hydrant latency is 2 seconds.
- If the hydrant delivers water at 5 gallons a second bandwidth is 5 gallons a second.
- Use memory latency if you want immediate response.
- Use bandwidth to fight big fires.
Memory Latency: An Example
- Consider a 1 GHz processor (1 ns clock), connected to DRAM with 100 ns latency (no caches).
- It has two multiply-add units, which can execute four instructions each cycle of 1 ns.
- Peak processor rating is 4 GFLOPS
- With memory latency at 100 ns the processor waits 100 cycles before processing data.
Memory Latency: An Example
- To compute the dot-product of two vectors, which takes one multiply-add on a single pair of vector where each floating-point operation requires one data fetch
- Peak computation speed is limited to one floating point operation every 100 ns.
Improving Effective Memory Latency Using Caches
- Caches are small and fast elements located between the processor and RAM.
- Caches provide low-latency, high-bandwidth storage.
- Using caches reduces the effective latency which can reduce data access time for frequently used data.
- Cache hit ratio determines system performance.
Impact of Caches: Example
- Consider a 1 GHz processor (1 ns clock), connected to DRAM with 100 ns latency (no caches, with two multiply-add units, which can execute four instructions each cycle of 1 ns.
- Caches of 32KB can run at 1ns, with 200us latency approximately. In this example code, we choose a 32x32 matrix for testing.
- Fetching two 32KB matrices takes 200µs.
- Multiplying two n x n matrices takes 32n32 operations. This corresponds to 64K operations which can be performed in 16K cycles or 16 µs using four instructions a cycle
- The total computational time is the load, store operations, and the computation itself which results in a total for this test of 200 + 16 µs
- Results to a peak computation rate of 64k/216 or 296FLOPs.
Impact of Memory Bandwidth
- Memory bandwidth is determined by memory bus bandwidth, which connects memory.
- The memory bus is a set of wires or conductors that connect electrical components to allow data and instructions to transfer from main memory to the CPU, or a memory controller
- Memory bandwidth improves when the size of the memory blocks are increased.
- Measured through time units to latency of the system to deliver b units of data (b being the data block size).
Impact of Memory Bandwidth: Example
- If the block size is four words (instead of one), the memory's bandwidth is 64 bits, and the bus frequency connecting the processor and memory operates at 800MHz given vectors in linearly in memory
- We can find dot product vectors with eightFLOPs in 200 cycles.
Impact of Memory Bandwidth: Example (continued)
- When accessing memory it fetches 4 consecutive words
- When data is accessed, the memory can fetch up to four words which corresponds every 25ns
- 64bits x 800 Mhz = 51200 M Bit/s == 6400M Bytes/s (Divided by 8 to turn bits to bytes)
- 6400/200 or 32MFLOPs
Impact of Memory Bandwidth (continued)
- Buses transfer data between the computers internal hardware.
- Bus capacity is measured in bits, thus, the greater the bus capacity the faster and better the system performance.
- Increasing block size does not change latency of the system.
- Higher bandwidth results to higher computation rates.
- Data layout of consecutive data words in memory are key.
- Computations must be reordered to enhance spatial locality of the reference.
Impact of Memory Bandwidth: Example
- The vector column_sum is small and easily fits into the cache
- The matrix b is accessed in a column order.
- The strided access results in very poor performance.
- Code shown to traverse a matrix row-by-row can be expected to perform better
Memory System Performance: Summary
- Spatial and temporal locality are critical to amortizing memory latency and increasing bandwidth.
- The operations to memory access ratio indicates memory bandwidth tolerance.
- Lay out memory and organize appropriately to impact spatial and temporal locality.
Control Structure of Parallel Computing
- Modern processors use 'Hyper Threading,' by which each core can perform in effect as multiple cores. Device Manager can show 16 processors or cores.
- Smartphones today use multicore processors of dual-core (two), quad-core (four) and octa-core (eight).
- Processing units in parallel computers either operate under one processor or multi-processors independently.
- Parallelism is expressed at various dispatch levels.
- If there is a single control unit that dispatches instruction to various processors then Single instruction stream, multiple data stream (SIMD).
- Multiple instruction stream, multiple data stream (MIMD), occurs where each processor has it's own control unit and executes data instructions differently.
Definitions of Computing
- A supercomputer is the world's fastest computer.
- SIMD is Single Instruction Stream, Multiple Data Stream. -One instruction at a time, each operating on an array of data
- MIMD is Multiple Instruction Stream, Multiple Data Stream. -Multiple processors asynchronously issuing instructions
- Shared Memory is a MIMD computer in which a common memory is accessible by all processors -UMA: Uniform memory access by all processors -NUMA: Non-uniform memory access, based on location/ assignment
- Distributed Memory is a MIMD computer in which the memory is partitioned into processor-private regions.
- RISC is Reduced Instruction Set Computer. -Uses fast and simple instructions to do complex things
- Pipelining refers to processing in assembly-line style.
Computing Taxonomy
- In 1966, Flynn proposed a taxonomy based on Instruction Streams and Data Streams:
- SISD: Single Instruction Stream, Single Data Stream – conventional sequential computer
- SIMD: Single Instruction Stream, Multiple Data Stream – control unit dispatches instructions to multiple processing units
- MISD: Multiple Instruction Stream, Single Data Stream – no architectures
- MIMD: Multiple Instruction Stream, Multiple Data Stream – processing unit executes instructions with a different control unit
MIMD Architectures
- Within the MIMD classification, different program structures and address space organizations exist.
- Programming Structures: -Single Program Multiple Data (SPMD) -Multiple Program Multiple Data (MPMD)
- Address Space Organization -Distributed Memory Architectures (Message-Passing Multicomputer) -Shared Memory Architectures: (Multiprocessor and Multi-Core)
SPMD
- Employed to achieve parallelism, is a subcategory of MIMD.
- Tasks which have different inputs achieve results quickly run independently.
- Runs simultaneous, independently and faster.
- SPMDs are the most common style of parallel programming
MPMD
- Refers to which each processor will have it's own program to execute within MIMD Classification.
Message Passing Architectures
- Interactions between processes running on different nodes are accomplished through message passing.
- The exchange of messages allows transfer of data, work, and synchronization.
Interconnection Networks
- Static Networks
- point-to-point communication links between processors -Are Direct Networks
- Dynamic Networks
- built using switches and communication -Are Indirect Networks
Criteria Static Networks
- Diameter -maximum shortest path between two processors gives worst-case latency
- Bisection Width -links must be cut to partition network into two halves that indicates of potential problems
- Bisection Bandwidth -width times bandwidth of links
Network Criteria
- Delay: Time taken for a packet to travel from the source to the destination.
- Transmission Delay, Propagation Delay, Queueing Delay, Processing Delay = Total Delay
- Time with the network -Network latency: refers to the time delay between issuing a memory request and the moment the data becomes available at the processor. -Communication Latency: transmission time + overhead
- Bandwidth -Number of bits that can be transmitted per second
- Connectivity -Number of paths between any two processors
- Cost -Number of links in the network
- Measures the actual transfer rate in the network, accounting for the datatransfer delay.
Communication Model of Parallel Platforms
- Two primary forms, a shared data space and message-passing
- Platforms that have a shared data space are shared-address-space machines or multiprocessors.
- Platforms that support messaging are message platforms or multi computers.
- In shared address space architectures, communication model is indirectly specified since some (or all) of the memory is accessible to all the processors.
Shared-Address-Space Platforms
- All the memory is accessible to all processors.
- Processors interact by modifying data objects stored in this shared-address-space.
- If the global or local time matches then the platform is classified with uniform memory access (UMA). Otherwise, non-uniform memory access (NUMA).
NUMA and UMA Shared-Address-Space Platforms
- Multiprocessors can be categorized into three shared-memory model which include Uniform Memory Access (UMA), Non-uniform Memory Access (NUMA), or Cache-only Memory Access (COMA)
- Single controller is used where uniform access is the case.
- In Non-Uniform different controller is used which cases it to be faster
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.