Introduction to Parallel Computing

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Parallel computing involves only bit-level parallelism.

False (B)

Parallelism's acceleration of computing speeds has only become apparent in recent years.

False (B)

Developing parallel hardware and software has consistently been less time-intensive than developing serial hardware and software.

False (B)

Moore's Law indicates that the number of transistors on integrated circuits doubles precisely every year.

<p>False (B)</p> Signup and view all the answers

According to Moore's Law, the number of components per integrated circuit for minimum cost was predicted to be about 65,000 by 1975.

<p>True (A)</p> Signup and view all the answers

DRAM access times have consistently improved at a faster rate than high-end processor speeds over the past decade.

<p>False (B)</p> Signup and view all the answers

DRAM stands for 'diminished random access memory'.

<p>False (B)</p> Signup and view all the answers

In applications with large data volumes, analyses must always be performed on the network directly, rather than moving the data.

<p>True (A)</p> Signup and view all the answers

Parallel computing exclusively focuses on cost reduction.

<p>False (B)</p> Signup and view all the answers

Parallel computing applications in engineering include optimization of airfoil lift, drag, and stability.

<p>True (A)</p> Signup and view all the answers

Parallel computing has absolutely no applications in the field of astrophysics.

<p>False (B)</p> Signup and view all the answers

Parallel computers are currently not used in financial sectors like Wall Street.

<p>False (B)</p> Signup and view all the answers

Embedded systems are moving away from the use of distributed control algorithms.

<p>False (B)</p> Signup and view all the answers

The scope of parallelism only addresses the processor component of conventional architectures.

<p>False (B)</p> Signup and view all the answers

Server applications primarily use aggregate throughput instead of aggregate network bandwidth.

<p>False (B)</p> Signup and view all the answers

Microprocessor clock speeds have not significantly changed in recent decades.

<p>False (B)</p> Signup and view all the answers

Processors are solely classified as microprocessors.

<p>False (B)</p> Signup and view all the answers

The primary function of a transistor is to store data.

<p>False (B)</p> Signup and view all the answers

Pipelining fully eliminates instruction execution time.

<p>False (B)</p> Signup and view all the answers

In pipelining, an instruction can be executed while the next one is being decoded and the next one is being fetched.

<p>True (A)</p> Signup and view all the answers

Conventional processors now use shallow pipelines with few stages.

<p>False (B)</p> Signup and view all the answers

The penalty of a pipeline misbehavior decreases with the pipeline's depth.

<p>False (B)</p> Signup and view all the answers

Superscalar execution involves performing parts of a single instruction at the same time.

<p>False (B)</p> Signup and view all the answers

Resource Dependency in superscalar execution means that one operation's result is an input to the next.

<p>False (B)</p> Signup and view all the answers

Branch Dependency in superscalar execution can always be predicted a-priori.

<p>False (B)</p> Signup and view all the answers

In superscalar processors, the scheduler looks at a small number of instructions.

<p>False (B)</p> Signup and view all the answers

In-order issue in superscalar execution means instructions are issued regardless of data dependencies.

<p>False (B)</p> Signup and view all the answers

Vertical waste in superscalar execution occurs when some functional units are utilized during a cycle.

<p>False (B)</p> Signup and view all the answers

VLIW processors rely on runtime analysis.

<p>False (B)</p> Signup and view all the answers

Latency is the rate at which data can be pumped to the processor by the memory system.

<p>False (B)</p> Signup and view all the answers

Low latency always means high bandwidth.

<p>False (B)</p> Signup and view all the answers

Memory bandwidth is independent of the memory bus.

<p>False (B)</p> Signup and view all the answers

Increasing the block size of words always reduces the latency of the system.

<p>False (B)</p> Signup and view all the answers

When data is accessed in a single memory, it fetches two consecutive words into the vector.

<p>False (B)</p> Signup and view all the answers

In column-major data access, the matrix is traversed in a row order thus resulting in more efficient performance.

<p>False (B)</p> Signup and view all the answers

Processing units in parallel computers can only operate under the centralized control of a single processor.

<p>False (B)</p> Signup and view all the answers

In SIMD, each processor has its own control unit executing different instructions.

<p>False (B)</p> Signup and view all the answers

RISC computers use complex instructions.

<p>False (B)</p> Signup and view all the answers

In MIMD, multiple processors asynchronously issuing instructions

<p>True (A)</p> Signup and view all the answers

In static networks, communication links are connected to one another dynamically.

<p>False (B)</p> Signup and view all the answers

Flashcards

Parallel Computing

A type of computation where multiple calculations or processes are carried out jointly, dividing large problems into smaller ones to be solved simultaneously.

Bit-level Parallelism

A type of parallelism that focuses on manipulating data at the bit level to perform multiple operations simultaneously.

Instruction-level Parallelism

A type of parallelism that allows multiple instructions from a program to be executed simultaneously, improving the execution speed.

Data Parallelism

A parallel computing approach where the same operation is performed on multiple data sets simultaneously, exploiting data dependencies.

Signup and view all the flashcards

Task Parallelism

A form of parallel computing where independent tasks are executed simultaneously, leveraging different parts of a system to enhance efficiency.

Signup and view all the flashcards

Motivation for Parallelism

The role of parallelism in accelerating computing speeds has been recognized for several decades, by providing increased ways to access storage elements.

Signup and view all the flashcards

Moore's Law

States that the number of transistors on integrated circuits doubles approximately every two years, leading to increased processing power and better performance.

Signup and view all the flashcards

Memory/Disk Speed Argument

Memory access times improve at a slower rate than processor speeds, causing a mismatch that leads to performance bottlenecks.

Signup and view all the flashcards

Latency

The time it takes from a memory request issue to when the data is available to the processor, an important parameter of memory system performance.

Signup and view all the flashcards

Bandwidth

A measure of memory system performance that indicates the rate at which data can be pumped to the processor.

Signup and view all the flashcards

Dynamic Random Access Memory (DRAM)

A memory type that works by storing bits of data in a capacitor inside an integrated circuit and is commonly used in RAM and GPUs.

Signup and view all the flashcards

Data Communication Argument

As the network evolves, databases and data mining are applications where the volume of data is such that it cannot be moved, so analyses must be performed over the network using parallel techniques.

Signup and view all the flashcards

Goal of applications of parallelism

Achieving performance-to-cost considerations in various application domains.

Signup and view all the flashcards

Use of parallelism in High-Speed Circuit design

Using parallel systems to evaluate layout options to optimize for delays and capacitive and inductive effects

Signup and view all the flashcards

Scientific Applications of Parallel Computing

Weather modeling, mineral prospecting and flood prediction.

Signup and view all the flashcards

Parallelism and Large Scale Servers

Using parallelism to power mail and web servers that are implemented using parallel platforms.

Signup and view all the flashcards

Parallelism in Computer Systems

Using parallelism to allow conventional structured peer-to-peer networks to impose overlay networks and utilize algorithms.

Signup and view all the flashcards

Scope of Parallelism

Different applications utilize different aspects of parallelism like data intensive applications utilize high aggregate throughput and server applications utilize high aggregate network bandwidth.

Signup and view all the flashcards

Microprocessor Gains

Microprocessor clock speeds have posted impressive gains over the past two decades

Signup and view all the flashcards

Processor Resource Use

Processors use resources in multiple functional units and execute multiple instructions in the same cycle.

Signup and view all the flashcards

Pipelining

Overlaps various stages of instruction execution to achieve performance.

Signup and view all the flashcards

Scheduling branch dependency

Scheduling instructions across conditional branch statements cannot be done deterministically a-priori

Signup and view all the flashcards

VLIW

Conventional processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently.

Signup and view all the flashcards

Improving Memory Latency

Achieving effective use of memory by using caches.

Signup and view all the flashcards

Role of Caches

A cache acts as a low-latency high-bandwidth storage between RAM and the processor.

Signup and view all the flashcards

Memory bandwidth determination

Memory bandwidth is determined by the bandwidth of the memory bus connected to the memory units.

Signup and view all the flashcards

Multicore Processors

Measure of electrical capacity across all the CPU cores.

Signup and view all the flashcards

SIMD

A parallel computing model where a single control unit dispatches the same instruction to various processors that work on different data.

Signup and view all the flashcards

MIMD

A parallel computing model where each processor has its own control unit and can execute different instructions on different data items.

Signup and view all the flashcards

Supercomputer

The world's fastest computer at a particular time.

Signup and view all the flashcards

Shared Memory

A MIMD computer in which a common memory is accessible by all processors.

Signup and view all the flashcards

Distributed Memory

A MIMD computer in which the memory is partitioned into processor-private regions.

Signup and view all the flashcards

SISD

Single instruction stream, single data stream: This is a conventional, sequential computer.

Signup and view all the flashcards

Multiple Instruction Stream-Multiple Data Stream (MIMD)

Each processor has a separate program & one instruction stream is generated from each program for each processor. Each instruction operates upon different data

Signup and view all the flashcards

SPMD

Single program, multipme data is a technique employed to achieve parallelism; it is a subcategory if MIMD.

Signup and view all the flashcards

Message Passing Architecture

Interaction between processes running on different nodes has to be accomplished using messages.

Signup and view all the flashcards

Dynamic Network

Communications links are connected to one another dynamically using the switches.

Signup and view all the flashcards

Bitsention Width

Minimum number of links that must be cut to partition the network into two equal or almost equal halves.

Signup and view all the flashcards

Primary Forms of data excahnge

Two primary forms of data exchange between parallel tasks is accessing a shared data space and exchanging messages/message-passing

Signup and view all the flashcards

Startup Time

Startup time is the time required to handle a message at the sending and receiving nodes, and the time to prepare a message.

Signup and view all the flashcards

Study Notes

Introduction to Parallel Computing

  • Parallel computing involves multiple processes or calculations carried out simultaneously.
  • This approach is used to divide large problems into smaller, solvable parts, enabling faster solutions.
  • Different categories of parallel computing exist, including bit-level, instruction-level, data, and task parallelism.

Topic Overview

  • Motivating factors for using parallelism.
  • Examines the scope of where parallel computing can be applied.

Motivation for Parallelism

  • Parallelism has long been recognized for its ability to speed up computations.
  • Commercial applications benefit from the multiplicity of data paths and increased access to storage elements provided by parallelism.
  • Parallel platforms offer scalable performance and lower costs, making them suitable for a wide range of applications.
  • Initially, creating parallel hardware and software was time-consuming and demanding.
  • Despite improving uniprocessor speeds, hardware design trends suggest uniprocessor architectures may not sustain performance increases due to physical and computational limits.
  • Standardized parallel programming environments, libraries, and hardware have made parallel solutions faster to implement.

The Computational Power Argument

  • Moore's Law: The number of transistors on integrated circuits doubles about every two years, leading to increased processing power and better performance.
  • Serial processors rely heavily on implicit parallelism.

The Memory/Disk Speed Argument

  • Processor speeds have risen ~40% annually over the last decade, DRAM access times have only improved ~10% annually, causing performance bottlenecks.
  • Parallel platforms offer increased bandwidth to the memory system.
  • DRAM is dynamic random access memory found in RAM and GPUs. It uses capacitors to store bits of data within an integrated circuit.
  • Parallel platforms provide higher aggregate caches.
  • Principles of parallel algorithm design apply to memory optimization.

The Data Communication Argument

  • The Internet is evolving into a large computing platform.
  • Many applications, especially in databases and data mining, involve data volumes too large to move.
  • Data analysis must occur over the network using parallel techniques.

Scope of Parallel Computing Applications

  • Parallelism addresses a wide range of applications across diverse domains, driven by various motivations.
  • The primary goal for improved applications is achieving better performance-to-cost ratios.

Applications in Engineering and Design

  • Airfoil design optimizes lift, drag, and stability.
  • Internal combustion engine design optimizes charge distribution and burn efficiency.
  • High-speed circuit design optimizes layouts for delays and capacitive/inductive effects.
  • Structural design optimizes structural integrity, design parameters, and cost.
  • Micro- and nano-scale systems facilitate design and simulation.
  • Process optimization and operations research are key application areas.

Scientific Applications

  • Functional and structural characterization of genes and proteins is achieved through scientific applications.
  • Advances in computational physics and chemistry explore new materials, understand chemical pathways, and create more efficient processes.
  • Astrophysics applications study galaxy evolution, thermonuclear processes, and analyze extensive telescope datasets.
  • Weather modeling, mineral prospecting, and flood prediction are important scientific applications.
  • Bioinformatics and astrophysics present challenges in analyzing extremely large datasets.

Commercial Applications

  • Parallel computers play a vital role in powering Wall Street.
  • Data mining and analysis are used to optimize business and marketing decisions.
  • Large-scale servers, such as mail and web servers, often use parallel platforms.
  • Information retrieval and search applications are powered by large clusters using parallelism.

Applications in Computer Systems

  • Core uses of parallel computing techniques include network intrusion detection, cryptography, and multiparty computations.
  • Embedded systems rely on distributed control algorithms increasingly.
  • Modern automobiles use many processors to optimize handling and performance.
  • Peer-to-peer networks use overlay networks and algorithms from parallel computing.

Scope of Parallelism

  • Conventional architectures have a processor, memory system, and datapath.
  • Performance bottlenecks exist in each component.
  • Parallelism addresses each component significantly.
  • Different applications use different aspects of parallelism such as data-intensive apps which use high aggregate throughput, server apps which use high aggregate network bandwidth, while scientific apps use high processing and memory system performance.
  • Microprocessor clock speeds have significantly increased over the past two decades.
  • Different types of processors include microprocessors, microcontrollers, embedded processors, and digital signal processors.
  • CPUs are classified as single-core, dual-core, quad-core, hexa-core, octa-core, and deca-core processors.
  • Current processors execute multiple instructions in the same cycle, using multiple functional units due to higher device integration and more transistors.
  • Transistors amplify electric current in a circuit.

Pipelining and Superscalar Execution

  • Pipelining improves performance by overlapping instruction execution stages.
  • Instructions can be executed, decoded, and fetched simultaneously.
  • This process resembles an assembly line in manufacturing.
  • Pipelining improves performance by overlapping instruction execution stages.
  • Instructions can be executed, decoded, and fetched simultaneously.
  • This process resembles an assembly line in manufacturing.
  • Pipelining has limitations with processors relying on deep pipelines which can result in every 5-6th instruction being a conditional jump
  • These conditional jumps can impact depth of the pipeline requiring instructions being flushed
  • Multiple pipelines alleviate the bottlenecks.

Superscalar Execution

  • Instruction scheduling is determined by true data dependency, resource dependency, and branch dependency.
  • The scheduler looks at instructions in a queue and determine how many can be executed concurrently based on several factors.
  • The complexity of hardware is an important constraint on superscalar processors.

Superscalar Execution: Issue Mechanisms

  • Instructions are issued in the order encountered known as in-order issue.
  • If an instruction can't be issued due to data dependency, only one instruction is issued in the cycle.
  • Instructions can be issued out of order this is known as dynamic issue.
  • When the second instruction has data dependencies with the first, the third can be co-scheduled.
  • In-order issue is generally limited in it's performance.

Superscalar Execution: Efficiency Considerations

  • Functional units cannot be kept active continually.
  • Vertical waste which occurs when no functional units are utilized during a cycle
  • Horizontal waste which occurs when functional units are utilized during a cycle.
  • Performance of superscalar processors is limited by parallelism in instruction traces, or when the scheduler can't extract parallelism.

Very Long Instruction Word (VLIW) Processors

  • Hardware cost and complexity of the superscalar scheduler is a major consideration in processor design.
  • VLIW processors analyze code at compile time to bundle instructions for concurrent execution.
  • The core concept was used in the Multiflow Trace machine around 1984.
  • Variants of this concept can be found in Intel IA64 processors.
  • Typical VLIW processors allow 4-way to 8-way parallelism.

Limitations of Memory System Performance

  • Memory system performance, not processor speed, often limits application performance.
  • Memory system performance is gauged by latency and bandwidth.
  • Memory Latency: The time it takes to get data once memory is requested by the process
  • Bandwidth: The rate to which data is pumped into the processor by the memory system.

Memory System Performance: Bandwidth and Latency

  • It is important to differentiate between latency and bandwidth.
  • Firehose example: if it takes 2 seconds to turn on the hydrant latency is 2 seconds.
  • If the hydrant delivers water at 5 gallons a second bandwidth is 5 gallons a second.
  • Use memory latency if you want immediate response.
  • Use bandwidth to fight big fires.

Memory Latency: An Example

  • Consider a 1 GHz processor (1 ns clock), connected to DRAM with 100 ns latency (no caches).
  • It has two multiply-add units, which can execute four instructions each cycle of 1 ns.
  • Peak processor rating is 4 GFLOPS
  • With memory latency at 100 ns the processor waits 100 cycles before processing data.

Memory Latency: An Example

  • To compute the dot-product of two vectors, which takes one multiply-add on a single pair of vector where each floating-point operation requires one data fetch
  • Peak computation speed is limited to one floating point operation every 100 ns.

Improving Effective Memory Latency Using Caches

  • Caches are small and fast elements located between the processor and RAM.
  • Caches provide low-latency, high-bandwidth storage.
  • Using caches reduces the effective latency which can reduce data access time for frequently used data.
  • Cache hit ratio determines system performance.

Impact of Caches: Example

  • Consider a 1 GHz processor (1 ns clock), connected to DRAM with 100 ns latency (no caches, with two multiply-add units, which can execute four instructions each cycle of 1 ns.
  • Caches of 32KB can run at 1ns, with 200us latency approximately. In this example code, we choose a 32x32 matrix for testing.
  • Fetching two 32KB matrices takes 200µs.
  • Multiplying two n x n matrices takes 32n32 operations. This corresponds to 64K operations which can be performed in 16K cycles or 16 µs using four instructions a cycle
  • The total computational time is the load, store operations, and the computation itself which results in a total for this test of 200 + 16 µs
  • Results to a peak computation rate of 64k/216 or 296FLOPs.

Impact of Memory Bandwidth

  • Memory bandwidth is determined by memory bus bandwidth, which connects memory.
  • The memory bus is a set of wires or conductors that connect electrical components to allow data and instructions to transfer from main memory to the CPU, or a memory controller
  • Memory bandwidth improves when the size of the memory blocks are increased.
  • Measured through time units to latency of the system to deliver b units of data (b being the data block size).

Impact of Memory Bandwidth: Example

  • If the block size is four words (instead of one), the memory's bandwidth is 64 bits, and the bus frequency connecting the processor and memory operates at 800MHz given vectors in linearly in memory
  • We can find dot product vectors with eightFLOPs in 200 cycles.

Impact of Memory Bandwidth: Example (continued)

  • When accessing memory it fetches 4 consecutive words
  • When data is accessed, the memory can fetch up to four words which corresponds every 25ns
  • 64bits x 800 Mhz = 51200 M Bit/s == 6400M Bytes/s (Divided by 8 to turn bits to bytes)
  • 6400/200 or 32MFLOPs

Impact of Memory Bandwidth (continued)

  • Buses transfer data between the computers internal hardware.
  • Bus capacity is measured in bits, thus, the greater the bus capacity the faster and better the system performance.
  • Increasing block size does not change latency of the system.
  • Higher bandwidth results to higher computation rates.
  • Data layout of consecutive data words in memory are key.
  • Computations must be reordered to enhance spatial locality of the reference.

Impact of Memory Bandwidth: Example

  • The vector column_sum is small and easily fits into the cache
  • The matrix b is accessed in a column order.
  • The strided access results in very poor performance.
  • Code shown to traverse a matrix row-by-row can be expected to perform better

Memory System Performance: Summary

  • Spatial and temporal locality are critical to amortizing memory latency and increasing bandwidth.
  • The operations to memory access ratio indicates memory bandwidth tolerance.
  • Lay out memory and organize appropriately to impact spatial and temporal locality.

Control Structure of Parallel Computing

  • Modern processors use 'Hyper Threading,' by which each core can perform in effect as multiple cores. Device Manager can show 16 processors or cores.
  • Smartphones today use multicore processors of dual-core (two), quad-core (four) and octa-core (eight).
  • Processing units in parallel computers either operate under one processor or multi-processors independently.
  • Parallelism is expressed at various dispatch levels.
  • If there is a single control unit that dispatches instruction to various processors then Single instruction stream, multiple data stream (SIMD).
  • Multiple instruction stream, multiple data stream (MIMD), occurs where each processor has it's own control unit and executes data instructions differently.

Definitions of Computing

  • A supercomputer is the world's fastest computer.
  • SIMD is Single Instruction Stream, Multiple Data Stream. -One instruction at a time, each operating on an array of data
  • MIMD is Multiple Instruction Stream, Multiple Data Stream. -Multiple processors asynchronously issuing instructions
  • Shared Memory is a MIMD computer in which a common memory is accessible by all processors -UMA: Uniform memory access by all processors -NUMA: Non-uniform memory access, based on location/ assignment
  • Distributed Memory is a MIMD computer in which the memory is partitioned into processor-private regions.
  • RISC is Reduced Instruction Set Computer. -Uses fast and simple instructions to do complex things
  • Pipelining refers to processing in assembly-line style.

Computing Taxonomy

  • In 1966, Flynn proposed a taxonomy based on Instruction Streams and Data Streams:
    • SISD: Single Instruction Stream, Single Data Stream – conventional sequential computer
    • SIMD: Single Instruction Stream, Multiple Data Stream – control unit dispatches instructions to multiple processing units
    • MISD: Multiple Instruction Stream, Single Data Stream – no architectures
    • MIMD: Multiple Instruction Stream, Multiple Data Stream – processing unit executes instructions with a different control unit

MIMD Architectures

  • Within the MIMD classification, different program structures and address space organizations exist.
  • Programming Structures: -Single Program Multiple Data (SPMD) -Multiple Program Multiple Data (MPMD)
  • Address Space Organization -Distributed Memory Architectures (Message-Passing Multicomputer) -Shared Memory Architectures: (Multiprocessor and Multi-Core)

SPMD

  • Employed to achieve parallelism, is a subcategory of MIMD.
  • Tasks which have different inputs achieve results quickly run independently.
  • Runs simultaneous, independently and faster.
  • SPMDs are the most common style of parallel programming

MPMD

  • Refers to which each processor will have it's own program to execute within MIMD Classification.

Message Passing Architectures

  • Interactions between processes running on different nodes are accomplished through message passing.
  • The exchange of messages allows transfer of data, work, and synchronization.

Interconnection Networks

  • Static Networks
  • point-to-point communication links between processors -Are Direct Networks
  • Dynamic Networks
  • built using switches and communication -Are Indirect Networks

Criteria Static Networks

  • Diameter -maximum shortest path between two processors gives worst-case latency
  • Bisection Width -links must be cut to partition network into two halves that indicates of potential problems
  • Bisection Bandwidth -width times bandwidth of links

Network Criteria

  • Delay: Time taken for a packet to travel from the source to the destination.
  • Transmission Delay, Propagation Delay, Queueing Delay, Processing Delay = Total Delay
  • Time with the network -Network latency: refers to the time delay between issuing a memory request and the moment the data becomes available at the processor. -Communication Latency: transmission time + overhead
  • Bandwidth -Number of bits that can be transmitted per second
  • Connectivity -Number of paths between any two processors
  • Cost -Number of links in the network
  • Measures the actual transfer rate in the network, accounting for the datatransfer delay.

Communication Model of Parallel Platforms

  • Two primary forms, a shared data space and message-passing
  • Platforms that have a shared data space are shared-address-space machines or multiprocessors.
  • Platforms that support messaging are message platforms or multi computers.
  • In shared address space architectures, communication model is indirectly specified since some (or all) of the memory is accessible to all the processors.

Shared-Address-Space Platforms

  • All the memory is accessible to all processors.
  • Processors interact by modifying data objects stored in this shared-address-space.
  • If the global or local time matches then the platform is classified with uniform memory access (UMA). Otherwise, non-uniform memory access (NUMA).

NUMA and UMA Shared-Address-Space Platforms

  • Multiprocessors can be categorized into three shared-memory model which include Uniform Memory Access (UMA), Non-uniform Memory Access (NUMA), or Cache-only Memory Access (COMA)
  • Single controller is used where uniform access is the case.
  • In Non-Uniform different controller is used which cases it to be faster

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser