Parallel Computing: Motivations and Scope

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is NOT a recognized role of parallelism in computing?

  • Accelerating computing speeds.
  • Increasing access to storage elements.
  • Reducing the need for standardized programming environments. (correct)
  • Providing multiplicity of data paths.

What trend in hardware design casts doubt on the sustainability of performance increments in uniprocessor architectures?

  • Fundamental physical and computational limitations. (correct)
  • Emergence of new sequential programming languages.
  • Decreasing transistor density on integrated circuits.
  • Reduced emphasis on memory optimization in algorithm design.

According to Moore's Law, what is the trend in the number of transistors on integrated circuits?

  • The number of transistors increases exponentially over time. (correct)
  • The number of transistors decreases exponentially over time.
  • The number of transistors remains constant.
  • The number of transistors decreases linearly over time.

What is the primary constraint caused by the mismatch in the rate of improvement between DRAM access times and processor speeds?

<p>Significant performance bottlenecks. (D)</p>
Signup and view all the answers

In the context of data communication, what necessitates performing analyses on data over the network using parallel techniques?

<p>The volume of data is too large to be moved. (C)</p>
Signup and view all the answers

What factor defines the primary goal for application improvement in the scope of parallel computing applications?

<p>Performance-to-cost considerations. (A)</p>
Signup and view all the answers

Which of the following is an application of parallel computing in engineering?

<p>Optimizing lift and drag in airfoil design. (D)</p>
Signup and view all the answers

In the context of scientific applications, what is one of the significant challenges that bioinformatics and astrophysics address using parallel computing?

<p>Analyzing extremely large datasets. (C)</p>
Signup and view all the answers

Which of the following is a common application of parallel computing in business?

<p>Optimizing Business and Marketing Decisions (A)</p>
Signup and view all the answers

What role does parallel computing play in modern automobiles?

<p>Performing complex tasks for optimizing handling and performance. (B)</p>
Signup and view all the answers

What aspect of conventional computer architecture is identified as a significant performance bottleneck?

<p>The memory system. (B)</p>
Signup and view all the answers

What is the relationship between the type of application and the aspect of parallelism it utilizes?

<p>Different applications utilize different aspects of parallelism. (D)</p>
Signup and view all the answers

What is the primary function of a transistor in a microprocessor?

<p>To amplify the electric current in a circuit. (B)</p>
Signup and view all the answers

How does pipelining improve processor performance?

<p>By overlapping various stages of instruction execution. (C)</p>
Signup and view all the answers

What is a key problem related to conditional jump instructions in the context of deep pipeline processors?

<p>The penalty of misbehavior with the depth of the pipeline, since a larger number of instructions will have to be flushed (B)</p>
Signup and view all the answers

What is a primary consideration that determines the scheduling of instructions in superscalar execution?

<p>The true data dependency. (B)</p>
Signup and view all the answers

What happens in the case of "in-order issue" when a second instruction cannot be issued because it depends on a first instruction?

<p>Only one instruction is issued in the cycle. (D)</p>
Signup and view all the answers

In superscalar execution, what is indicated by "vertical waste?"

<p>No functional units are utilized during a cycle. (D)</p>
Signup and view all the answers

What analysis method do VLIW processors rely on to bundle instructions together for concurrent execution?

<p>Compile time analysis. (C)</p>
Signup and view all the answers

In the context of memory system performance, what does 'latency' refer to?

<p>The time from the issue of a memory request to the time the data is available. (B)</p>
Signup and view all the answers

What is the measure of bandwidth?

<p>Number of bits that can be transmitted per second. (D)</p>
Signup and view all the answers

If a fire hose delivers water at a rate of 5 gallons/second, what does this rate represent in terms of memory system performance?

<p>The bandwidth of the water flow. (B)</p>
Signup and view all the answers

What is the benefit of using caches in a memory system?

<p>To reduce the effective latency of the memory system. (A)</p>
Signup and view all the answers

What term defines the portion of data references that are successfully found in the cache?

<p>Cache hit ratio. (C)</p>
Signup and view all the answers

What can memory bandwidth be improved by?

<p>Increasing the size of memory blocks (C)</p>
Signup and view all the answers

What remains unaffected even if the block size of words is increased?

<p>Latency. (B)</p>
Signup and view all the answers

What programming approach can be used to improve performance when implementing a data-layout centric view?

<p>Reordering computations to enhance spatial locality of reference. (A)</p>
Signup and view all the answers

What is the significance of the 'Hyper Threading' feature in modern processors?

<p>Each core can perform as being multicore. (A)</p>
Signup and view all the answers

Under what control structure do processing units operate in parallel computers?

<p>Either under the centralized control of a single processor or independently. (D)</p>
Signup and view all the answers

What is the defining characteristic of a Single Instruction, Multiple Data (SIMD) architecture?

<p>A single control unit dispatches the same instruction to various processors. (A)</p>
Signup and view all the answers

What is 'Pipelining'?

<p>Processing in assembly-line style (A)</p>
Signup and view all the answers

What distinguishes Multiple Instruction Stream, Multiple Data Stream (MIMD) architecture from other parallel processing architectures?

<p>Multiple processors asynchronously issuing instructions (C)</p>
Signup and view all the answers

What characterizes a Shared Memory MIMD computer?

<p>A common memory is accessible by all processors (B)</p>
Signup and view all the answers

Which of Flynn's classifications describes a conventional sequential computer?

<p>SISD (Single Instruction, Single Data) (A)</p>
Signup and view all the answers

In the context of MIMD architectures, what does SPMD stand for?

<p>Single Program Multiple Data (A)</p>
Signup and view all the answers

In message passing architectures, what is the purpose of exchanging messages between processes?

<p>To transfer data, work, and synchronize actions. (C)</p>
Signup and view all the answers

What is the key difference between static and dynamic interconnection networks?

<p>Static networks use point-to-point communication links, while dynamic networks connect communication links dynamically using switches. (A)</p>
Signup and view all the answers

What does 'Bisection Width' refer to?

<p>Minimum number of links that must be cut to partition the network (B)</p>
Signup and view all the answers

Which of the following is a factor that affects network delay?

<p>Transmission delay. (C)</p>
Signup and view all the answers

In parallel computing, what is the primary way tasks exchange data on shared-address-space platforms?

<p>By accessing a shared data space. (B)</p>
Signup and view all the answers

In shared-address-space architectures, how is the communication model typically specified?

<p>Indirectly (D)</p>
Signup and view all the answers

What defines a uniform memory access (UMA) platform?

<p>The time taken by a processor to access any memory is consistent. (C)</p>
Signup and view all the answers

How do processors interact in shared-address-space systems?

<p>By modifying data objects stored in shared-address-space. (A)</p>
Signup and view all the answers

According to the material: Which type of memory access (UMA or NUMA) has more bandwidth?

<p>NUMA (A)</p>
Signup and view all the answers

When referring to communication costs in parallel machines, what does 'startup time' represent?

<p>The time required to handle a message at the sending and receiving nodes. (C)</p>
Signup and view all the answers

Flashcards

Parallel Computing

A type of computation where many calculations/processes are carried out jointly to solve large problems simultaneously.

Motivation for Parallelism

The role of parallelism in increasing computing speeds, multiplicity of data paths, and scalable performance.

Moore's Law

The exponential increase in transistors on integrated circuits over time.

Memory/Disk Speed Argument

Memory access time improvement rate is slower than processors, causing performance bottlenecks. Parallel platforms help.

Signup and view all the flashcards

Data Communication Argument

As networks evolve, the internet emerges as one large computing platform, analyses performed over the network with parallel techniques.

Signup and view all the flashcards

Forms of Parallelism

Different forms of parallelism: bit-level, instruction-level, data and task parallelism.

Signup and view all the flashcards

Scope of Parallel Computing Applications

Diverse applications of parallelism across engineering, science, commercial and computer systems.

Signup and view all the flashcards

Pipelining

Overlapping instruction execution stages to improve performance.

Signup and view all the flashcards

Superscalar Execution

Executing multiple instructions simultaneously.

Signup and view all the flashcards

In-Order Issue

Instructions issued in the order encountered.

Signup and view all the flashcards

Out-of-Order Issue

Instructions issued out of order, where the second instruction has data dependencies with the first.

Signup and view all the flashcards

Vertical Waste

No functional units are utilized.

Signup and view all the flashcards

Horizontal Waste

Some functional units are utilized

Signup and view all the flashcards

Very Long Instruction Word (VLIW)

Rely on compile-time analysis to bundle instructions for concurrent execution.

Signup and view all the flashcards

Memory Latency

Time from issue of memory request to data availability.

Signup and view all the flashcards

Memory Bandwidth

Rate at which data can be pumped to the processor by the memory system.

Signup and view all the flashcards

Caches

Fast memory between processor and RAM.

Signup and view all the flashcards

Cache Hit Ratio

Fraction of data references satisfied by the cache.

Signup and view all the flashcards

Memory Bus

Bus which connects electrical components so it transfers data and instructions from memory to processor.

Signup and view all the flashcards

Memory Access

Where accessing in a single memory fetches four consecutive words into a vector.

Signup and view all the flashcards

Buses

Paths that transmit data.

Signup and view all the flashcards

Hyper Threading

Feature that allows each processor core to perform as multiple cores.

Signup and view all the flashcards

Processors Operate Under Centralized Control

Processors either use single control or are work multi-processors independently.

Signup and view all the flashcards

SIMD

Single instruction stream, multiple data stream (SIMD).

Signup and view all the flashcards

MIMD

Multiple instruction stream, multiple data stream (MIMD).

Signup and view all the flashcards

Supercomputer

The world's fastest computer at the time.

Signup and view all the flashcards

Shared Memory

A MIMD computer where a common memory is accessible by all processors.

Signup and view all the flashcards

Distributed Memory

A MIMD computer in which the memory is partitioned into processor-private regions.

Signup and view all the flashcards

Reduced Instruction Set Computer (RISC)

Uses fast and simple instructions to do complex things.

Signup and view all the flashcards

Pipelining

Processing in assembly-line style.

Signup and view all the flashcards

Computing Taxonomy

A taxonomy based on instruction and data streams.

Signup and view all the flashcards

SISD

Single Instruction, Single Data Stream - this is a conventional sequential computer.

Signup and view all the flashcards

SIMD

Single Instruction Stream, Multiple Data Stream - a single control unit dispatches instructions to multiple processing units.

Signup and view all the flashcards

MISD

Multiple Instruction Stream, Single Data Stream – no architectures in this group.

Signup and view all the flashcards

MIMD

Multiple Instruction Stream, Multiple Data Stream – each processing unit is able to execute instructions from a different control unit.

Signup and view all the flashcards

MIMD Architectures

Where we can have a classification with different program structures and address space organization.

Signup and view all the flashcards

SPMD

Single Program Multiple Data, technique employed to achieve parallelism.

Signup and view all the flashcards

MPMD

Multiple Program Multiple Data Structure, each processor will have its own program to execute.

Signup and view all the flashcards

Impact of Memory Bandwidth

Are responsible for transferring data and includes memory, processor, and input and output devices.

Signup and view all the flashcards

Message Passing Architectures

Complete computers connected through an interconnection network:.

Signup and view all the flashcards

Study Notes

Parallel Computing Introduction

  • It involves many calculations or processes carried out jointly, where large problems are divided and solved simultaneously.
  • Forms of parallel computing include bit-level, instruction-level, data, and task parallelism.

Topic Overview

  • Covers motivations for using parallelism.
  • Discusses the scope of applications that benefit from parallel computing.

Motivation for Parallelism

  • Parallelism can accelerates computing speeds
  • Provides multiple data paths, enhances access to storage, and is useful in commercial applications.
  • Lower costs and scalable performance
  • Standardized parallel programming environments and hardware reduce time to solution.

The Computational Power Argument

  • Moore's Law states the number of transistors on integrated circuits doubles about every two years, enhancing processing power.
  • Serial processors often use implicit parallelism.
  • By 1975, the number of components per integrated circuit for minimum cost was 65,000.

The Memory/Disk Speed Argument

  • Processor speeds increase faster (40% per year) than DRAM access times (10% per year), causing bottlenecks.
  • Parallel platforms offer increased bandwidth to the memory system.
  • DRAM is dynamic random access memory used in RAM and GPUs, storing data bits in a capacitor within an integrated circuit.
  • Aggregate caches are higher in parallel platforms.
  • Parallel algorithm design enables memory optimization improving data transfer to memory and disk.

The Data Communication Argument

  • The Internet's evolution envisions it as a large computing platform.
  • Large data volumes in databases and data mining require parallel techniques for analysis over the network.

Scope of Parallel Computing Applications

  • Parallelism is applied across diverse application domains for performance and cost benefits.

Applications in Engineering and Design

  • The design of airfoils to optimize lift, drag and stability.
  • Optimizing charge distribution and burn in the design of internal combustion engines.
  • Optimizing layouts for delays and capacitive and inductive effects in high-speed circuits.
  • Optimizing strutural integrity, design parameters and costs for structrures.
  • Design and simulation of micro- and nano-scale systems.
  • Process optimization and opperations research.

Scientific Applications

  • Functional and structural characterization of genes and proteins.
  • Advances in computational physics and chemistry explore new materials and chemical pathways.
  • Areas in astrophysics explore galaxy evolution, thermonuclear processes and analyze telescope data.
  • Weather modeling, mineral prospecting and flood prediction.
  • Bioinformatics and astrophysics present challenges in analyzing large datasets.

Commercial Applications

  • Parallel computers power Wall Street.
  • Data mining and analysis optimize business and marketing.
  • Large-scale mail and web servers use parallel platforms.
  • Applications like information retrieval and search use large clusters for parallelism.

Applications in Computer Systems

  • Uses include network intrusion detection, cryptography and multiparty computations.
  • Embedded systems use distributed control algorithms.
  • Modern automobiles use several processors for handling and performance optimization.
  • Peer-to-peer networks utilize overlay networks and algorithms from parallel computing.

Scope of Parallelism

  • Conventional computers includes a datapath, memory system and a processor.
  • These components introduce performace related bottlenecks.
  • Data-intensive applications utilizes a high throughput.
  • Server applications use a high network bandwidth.
  • Scientific applications need high memory performance and high processing.
  • Microprocessor clock speeds have rapidly increased over recent decades.
  • Different processors are used for specific devices (microprocessors, microcontrollers, embedded processors, digital signal processors).
  • CPUs are classified as single-core, dual-core, quad-core, hexa-core, octa-core, and deca-core.
  • Higher device integration allows for more transistors.
  • Current processors use resources in multiple functional units, executing many instructions per cycle.
  • Transistors amplify electric current in circuits.

Pipelining and Superscalar Execution

  • Pipelining betters overall performance.
  • A pipelining process executes instructions while decoding the next and fetching the subsequent one.
  • Pipelining faces limitations, optimized through very deep pipelines in modern processors (approximately 20 stages in Pentium processors).
  • Conditional jumps every 5-6 instructions leads to performance penalties and a large number of instructions have to be flushed, related to the depth of the pipeline.
  • Multiple pipelines can alleviate such bottlenecks.

Superscalar Execution

  • Factors determining scheduling of instructions
  • True Data Dependency makes the result of one operation the input to the next.
  • Resource Dependency means that two operations require the same resource.
  • Branch Dependency refers to scheduling conditional branch statements which cannot be predetermined.
  • The scheduler looks at a number of instructions from a que and selects a number of instructions to execute based on these factors above.
  • The complexity of the hardware constrains superscalar processors.

Superscalar Execution: Issue Mechanisms

  • Instructions are issued in order as they are encountered.
  • If the second instruction has a data dependency with the first one, then only one instruction is issued every cycle, called in-order issue.
  • Instructions are issued out of order.
  • If the second instruction has dependencies with the first instruction, the first and third instructions can be scheduled this is called dynamic issue.
  • In-order issues are generally limited.

Superscalar Execution: Efficiency Considerations

  • Not all functional units work at all times.
  • Vertical waste is when no functional units are in use.
  • Horizontal waste is when some functional units are utilized.
  • Performance is limited by lack of parallelism and inability of the scheduler to extract parallelism.

Very Long Instruction Word (VLIW) Processors

  • Hardware costs and the complexity of the superscalar schedule are a major concern in processor design.
  • VLIW processors identify and group intructions that can be executed concurrently during compile time.
  • This was used in the Multiflow Trace machine (circa 1984).
  • Intel IA64 processors may use this concept.
  • Limited to 4-way to 8-way parallelism.

Limitations of Memory System Performance

  • Memory systems, and not processor speed, are often the bottleneck.
  • The memory system's performance is measured with bandwidth and latency
  • Latency is the measure of time from the demand to when the data is available at the processor.
  • Bandwidth is the rate at which data is pumped to the processor.

Memory System Performance: Bandwidth and Latency

  • Bandwidth refers to the rate of data transfer (gallons/second).
  • Latency refers to the delay before the data transfer begins (seconds).
  • Reducing latency leads to a better immediate response
  • High bandwidth is needed when fighting big fires.

Memory Latency: An Example

  • A 1 GHz processor (1 ns clock) connects to DRAM with 100 ns latency which means that no data caches exist.
  • With multiply-add units, the processor can execute four intructions every 1 ns.
  • A measure of computer performance comes from the peak processor rating is 4 GFLOPS and is often usefull for scientific computations.
  • The processor must wait 100 cycles before it can process the data since memory latency is 100ns.

Memory Latency: An Example (Continued)

  • On a problem computing a dot product of two vectors for the architecture mentioned:
  • Each floating point operation performs one multiply-add on a single pair vector elements and needs one data fetch.
  • Due to the peak speed limitation, it is limited to one floating point operation every 100ns representing a very small fraction of the processors peak rating!

Improving Effective Memory Latency Using Caches

  • Caches are used as a fast memory element between the processor and the RAM.
  • Caches act as high-bandwidth storage with low-latency.
  • Caches will reduce the latency of the memory system if data is reused.
  • The cache hit ratio is the ratio of data references satisfied by the cache.
  • The cache hit ratio acheived determines performance.

Impact of Caches: Example

  • A processor has 2 multiply-add units and performs 4 instructions per 1 ns cycle at 1 GHz with DRAM latency of 100 ns, no caches
  • A cache of 32KB is added which takes approximately 200 microseconds with a 1 ns latency.
  • Two matrices are multiplied and the cache is large enough to result the matrix.

Impact of Caches: Example Part 2

  • Fetching the two matrices into the cache equals the 32 KBs at 200 µs.
  • Multiplying two nxn matrices equals 32 x 32 operations for 64 K operations for this specific problem and it can be preformed in 16K cycles.
  • The total time equals time for load/store operations with time for computation equaling approximately 200 + 16 µs.
  • A peak computation rate roughly equals 64K / 216 or 296.29 FLOPS.

Impact of Memory Bandwidth

  • It is determined by the bandwidth of the memory bus which connects the memory units.
  • The memory bus wires/conductors connect electrical components and allow transfers from RAM to the memory controller or the CPU.
  • The memory bandwidth may be improved by expanding memory capacity.
  • Memory bandwidth = time (s) to deliver b units of data for where b represents the data block size.

Impact of Memory Bandwidth: Example

  • There are 4 words in the block size instead of 1 word.
  • A bandwidth of the tracks to data to the procressor = 64 bits.
  • A frequency of bus connects the memory and the processor is 800MHz.
  • Dot product vectors computation has 8 Flops (4 multiply-adds) for 200 cycles.

Impact of Memory Bandwidth: Example (Continued)

  • When accessing in a single memory fetches four words into the vector.
  • data access of memory can fetch four words to each of the vectors. Thus the transmit time to a FLOP every 25 ns which corresponts to (200/8).
  • Data bandwidth = 64 bits x 800 MHz, or 51200 MBit/s. This is 6400 M Byte/s divided by 8 for bits to bytes.
  • Solving for speed = 6400 / 200, the result would be 32 MFLOPS

Impact of Memory Bandwidth (Continued)

  • Data paths are called buses and are responsible for transferring data to the computers hardware which accounts for memory, input and output devices, processors etc.
  • Measured in bits, system performance improves as bus capacity increases.
  • Increasing the block size of world does not change the system's latency
  • viewed as a wide data bus attached to multiple memory banks (4 words or 128 bits).
  • More practical systems send subsequent bus cycles after the first word appears.

Impact of Memory Bandwidth (Continued).

  • The examples from above indicate hat increased bandwidth results in higher peak computation rates.
  • Assuming data layouts are sequential, data words are used by subsequent instructions (spatial locality of reference).
  • computations are reordered to promote spatial reference locality with a focus on data layout.

Impact of Memory Bandwidth: Specific Example

  • Code fragment sums comumns of the b matrix into a column_sum vector.

Specific Example of Impact of Memory Bandwidth (Continued)

  • The column_sum vector is small and easily fits in cache.
  • Matrix b is accessed in a column order.
  • Poor performance may result from strided access.

Impact of Memory Bandwith Solution

  • The matrix is traversed by traversing a row order where increased performance may occure.

Memory System Performance: Summary

  • For increasing memory bandwidth and amortization, spatial and temporal locality in applications are key.
  • The ratio between operations and accesses is an indicator to indicate the memory bandwidth.
  • Spatial and temporal locality is significantly impacted by organizing memory layouts and computing.

Control Structure of Parallel Computing

  • Processors use "Hyper Threading" so each core acts like multiple cores, this allows for device manager to recognize 16 processors.
  • Smart phones span to have octa-core, quad-core and dual-core variations.
  • There are processors which work independantly from a centralized processor in the parallel system.

Control Structure of Parallel Computing (Continued)

  • Depending on dispatch level, from processes to instructions, parallelism is expressed.
  • Architectural support and range of models exist.
  • SIMD dispatches some instruction through a single unit to processors and multiple data streams.
  • MIMD allows each processor to have a control unit to instruction multiple data items.

Definitions

  • Supercomputer: The fastest computer in the world at the time.
  • SIMD: Single instruction stream, multiple data stream
  • MIMD: Multiple instruction stream, multiple data stream
    • Multiple processors issue instructions asynchronously.
  • Shared memory: a common memory for all processors in a MIMD computer.
    • UMA: Uniform memory acess by all processors.
    • NUMA: Non-uniform memory acess based on location.
  • Distributed memory: Memory partitioned into processor-private regions in MIND computer.
  • Reduced instruction set computer: RISC.
    • Complex functions are performed by simple and faster instructions.
  • Processing in assembly-line style is called pipelining.

Computing Taxonomy

  • In 1966, Flynn proposed that the taxonomy on instruction streams and data streams
    • SISD :Single Instruction stream/ Single Data stream represents a conventional form of a sequantial computer.
    • SIMD: Single instruction Stream/ Multiple Data Stream has one control unit delegating instructions to multiple processing units.
    • MISD: Multiple Instruction stream/ Single Data Stream group has some including pipelined architectures, however has no architectures.
    • MIMD : Multiple Instruction stream/ Multiple Data Stream group represents each unit being able to use a different unit control.

MIMD Architectures

  • MIMD classification include different program structures and different address space organization
    • SPMD or Single program Multiple data for programming structure
    • MPMD or Multiple Program multiple data for programming structure Address space organization:
    • Distributed Memory Arch
    • Shared memory arch
      • Includes multiprocessors and multi Core.

SPMD

  • Tasks are run simulataneously and split onto multiprocessors for running independantly and has multiple data and a general subcategory classification under MIMD.
  • It represents a main paradigm within parallel programming.
  • In SPMD, autonomous units simultanueously may excecute the same general paradigm.

Message Passing Architectures

  • Need messages, so the name refers to messages that transfer data work and synch actions of a process on mission of an internet connection

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser