Podcast
Questions and Answers
Which of the following is NOT a recognized role of parallelism in computing?
Which of the following is NOT a recognized role of parallelism in computing?
- Accelerating computing speeds.
- Increasing access to storage elements.
- Reducing the need for standardized programming environments. (correct)
- Providing multiplicity of data paths.
What trend in hardware design casts doubt on the sustainability of performance increments in uniprocessor architectures?
What trend in hardware design casts doubt on the sustainability of performance increments in uniprocessor architectures?
- Fundamental physical and computational limitations. (correct)
- Emergence of new sequential programming languages.
- Decreasing transistor density on integrated circuits.
- Reduced emphasis on memory optimization in algorithm design.
According to Moore's Law, what is the trend in the number of transistors on integrated circuits?
According to Moore's Law, what is the trend in the number of transistors on integrated circuits?
- The number of transistors increases exponentially over time. (correct)
- The number of transistors decreases exponentially over time.
- The number of transistors remains constant.
- The number of transistors decreases linearly over time.
What is the primary constraint caused by the mismatch in the rate of improvement between DRAM access times and processor speeds?
What is the primary constraint caused by the mismatch in the rate of improvement between DRAM access times and processor speeds?
In the context of data communication, what necessitates performing analyses on data over the network using parallel techniques?
In the context of data communication, what necessitates performing analyses on data over the network using parallel techniques?
What factor defines the primary goal for application improvement in the scope of parallel computing applications?
What factor defines the primary goal for application improvement in the scope of parallel computing applications?
Which of the following is an application of parallel computing in engineering?
Which of the following is an application of parallel computing in engineering?
In the context of scientific applications, what is one of the significant challenges that bioinformatics and astrophysics address using parallel computing?
In the context of scientific applications, what is one of the significant challenges that bioinformatics and astrophysics address using parallel computing?
Which of the following is a common application of parallel computing in business?
Which of the following is a common application of parallel computing in business?
What role does parallel computing play in modern automobiles?
What role does parallel computing play in modern automobiles?
What aspect of conventional computer architecture is identified as a significant performance bottleneck?
What aspect of conventional computer architecture is identified as a significant performance bottleneck?
What is the relationship between the type of application and the aspect of parallelism it utilizes?
What is the relationship between the type of application and the aspect of parallelism it utilizes?
What is the primary function of a transistor in a microprocessor?
What is the primary function of a transistor in a microprocessor?
How does pipelining improve processor performance?
How does pipelining improve processor performance?
What is a key problem related to conditional jump instructions in the context of deep pipeline processors?
What is a key problem related to conditional jump instructions in the context of deep pipeline processors?
What is a primary consideration that determines the scheduling of instructions in superscalar execution?
What is a primary consideration that determines the scheduling of instructions in superscalar execution?
What happens in the case of "in-order issue" when a second instruction cannot be issued because it depends on a first instruction?
What happens in the case of "in-order issue" when a second instruction cannot be issued because it depends on a first instruction?
In superscalar execution, what is indicated by "vertical waste?"
In superscalar execution, what is indicated by "vertical waste?"
What analysis method do VLIW processors rely on to bundle instructions together for concurrent execution?
What analysis method do VLIW processors rely on to bundle instructions together for concurrent execution?
In the context of memory system performance, what does 'latency' refer to?
In the context of memory system performance, what does 'latency' refer to?
What is the measure of bandwidth?
What is the measure of bandwidth?
If a fire hose delivers water at a rate of 5 gallons/second, what does this rate represent in terms of memory system performance?
If a fire hose delivers water at a rate of 5 gallons/second, what does this rate represent in terms of memory system performance?
What is the benefit of using caches in a memory system?
What is the benefit of using caches in a memory system?
What term defines the portion of data references that are successfully found in the cache?
What term defines the portion of data references that are successfully found in the cache?
What can memory bandwidth be improved by?
What can memory bandwidth be improved by?
What remains unaffected even if the block size of words is increased?
What remains unaffected even if the block size of words is increased?
What programming approach can be used to improve performance when implementing a data-layout centric view?
What programming approach can be used to improve performance when implementing a data-layout centric view?
What is the significance of the 'Hyper Threading' feature in modern processors?
What is the significance of the 'Hyper Threading' feature in modern processors?
Under what control structure do processing units operate in parallel computers?
Under what control structure do processing units operate in parallel computers?
What is the defining characteristic of a Single Instruction, Multiple Data (SIMD) architecture?
What is the defining characteristic of a Single Instruction, Multiple Data (SIMD) architecture?
What is 'Pipelining'?
What is 'Pipelining'?
What distinguishes Multiple Instruction Stream, Multiple Data Stream (MIMD) architecture from other parallel processing architectures?
What distinguishes Multiple Instruction Stream, Multiple Data Stream (MIMD) architecture from other parallel processing architectures?
What characterizes a Shared Memory MIMD computer?
What characterizes a Shared Memory MIMD computer?
Which of Flynn's classifications describes a conventional sequential computer?
Which of Flynn's classifications describes a conventional sequential computer?
In the context of MIMD architectures, what does SPMD stand for?
In the context of MIMD architectures, what does SPMD stand for?
In message passing architectures, what is the purpose of exchanging messages between processes?
In message passing architectures, what is the purpose of exchanging messages between processes?
What is the key difference between static and dynamic interconnection networks?
What is the key difference between static and dynamic interconnection networks?
What does 'Bisection Width' refer to?
What does 'Bisection Width' refer to?
Which of the following is a factor that affects network delay?
Which of the following is a factor that affects network delay?
In parallel computing, what is the primary way tasks exchange data on shared-address-space platforms?
In parallel computing, what is the primary way tasks exchange data on shared-address-space platforms?
In shared-address-space architectures, how is the communication model typically specified?
In shared-address-space architectures, how is the communication model typically specified?
What defines a uniform memory access (UMA) platform?
What defines a uniform memory access (UMA) platform?
How do processors interact in shared-address-space systems?
How do processors interact in shared-address-space systems?
According to the material: Which type of memory access (UMA or NUMA) has more bandwidth?
According to the material: Which type of memory access (UMA or NUMA) has more bandwidth?
When referring to communication costs in parallel machines, what does 'startup time' represent?
When referring to communication costs in parallel machines, what does 'startup time' represent?
Flashcards
Parallel Computing
Parallel Computing
A type of computation where many calculations/processes are carried out jointly to solve large problems simultaneously.
Motivation for Parallelism
Motivation for Parallelism
The role of parallelism in increasing computing speeds, multiplicity of data paths, and scalable performance.
Moore's Law
Moore's Law
The exponential increase in transistors on integrated circuits over time.
Memory/Disk Speed Argument
Memory/Disk Speed Argument
Signup and view all the flashcards
Data Communication Argument
Data Communication Argument
Signup and view all the flashcards
Forms of Parallelism
Forms of Parallelism
Signup and view all the flashcards
Scope of Parallel Computing Applications
Scope of Parallel Computing Applications
Signup and view all the flashcards
Pipelining
Pipelining
Signup and view all the flashcards
Superscalar Execution
Superscalar Execution
Signup and view all the flashcards
In-Order Issue
In-Order Issue
Signup and view all the flashcards
Out-of-Order Issue
Out-of-Order Issue
Signup and view all the flashcards
Vertical Waste
Vertical Waste
Signup and view all the flashcards
Horizontal Waste
Horizontal Waste
Signup and view all the flashcards
Very Long Instruction Word (VLIW)
Very Long Instruction Word (VLIW)
Signup and view all the flashcards
Memory Latency
Memory Latency
Signup and view all the flashcards
Memory Bandwidth
Memory Bandwidth
Signup and view all the flashcards
Caches
Caches
Signup and view all the flashcards
Cache Hit Ratio
Cache Hit Ratio
Signup and view all the flashcards
Memory Bus
Memory Bus
Signup and view all the flashcards
Memory Access
Memory Access
Signup and view all the flashcards
Buses
Buses
Signup and view all the flashcards
Hyper Threading
Hyper Threading
Signup and view all the flashcards
Processors Operate Under Centralized Control
Processors Operate Under Centralized Control
Signup and view all the flashcards
SIMD
SIMD
Signup and view all the flashcards
MIMD
MIMD
Signup and view all the flashcards
Supercomputer
Supercomputer
Signup and view all the flashcards
Shared Memory
Shared Memory
Signup and view all the flashcards
Distributed Memory
Distributed Memory
Signup and view all the flashcards
Reduced Instruction Set Computer (RISC)
Reduced Instruction Set Computer (RISC)
Signup and view all the flashcards
Pipelining
Pipelining
Signup and view all the flashcards
Computing Taxonomy
Computing Taxonomy
Signup and view all the flashcards
SISD
SISD
Signup and view all the flashcards
SIMD
SIMD
Signup and view all the flashcards
MISD
MISD
Signup and view all the flashcards
MIMD
MIMD
Signup and view all the flashcards
MIMD Architectures
MIMD Architectures
Signup and view all the flashcards
SPMD
SPMD
Signup and view all the flashcards
MPMD
MPMD
Signup and view all the flashcards
Impact of Memory Bandwidth
Impact of Memory Bandwidth
Signup and view all the flashcards
Message Passing Architectures
Message Passing Architectures
Signup and view all the flashcards
Study Notes
Parallel Computing Introduction
- It involves many calculations or processes carried out jointly, where large problems are divided and solved simultaneously.
- Forms of parallel computing include bit-level, instruction-level, data, and task parallelism.
Topic Overview
- Covers motivations for using parallelism.
- Discusses the scope of applications that benefit from parallel computing.
Motivation for Parallelism
- Parallelism can accelerates computing speeds
- Provides multiple data paths, enhances access to storage, and is useful in commercial applications.
- Lower costs and scalable performance
- Standardized parallel programming environments and hardware reduce time to solution.
The Computational Power Argument
- Moore's Law states the number of transistors on integrated circuits doubles about every two years, enhancing processing power.
- Serial processors often use implicit parallelism.
- By 1975, the number of components per integrated circuit for minimum cost was 65,000.
The Memory/Disk Speed Argument
- Processor speeds increase faster (40% per year) than DRAM access times (10% per year), causing bottlenecks.
- Parallel platforms offer increased bandwidth to the memory system.
- DRAM is dynamic random access memory used in RAM and GPUs, storing data bits in a capacitor within an integrated circuit.
- Aggregate caches are higher in parallel platforms.
- Parallel algorithm design enables memory optimization improving data transfer to memory and disk.
The Data Communication Argument
- The Internet's evolution envisions it as a large computing platform.
- Large data volumes in databases and data mining require parallel techniques for analysis over the network.
Scope of Parallel Computing Applications
- Parallelism is applied across diverse application domains for performance and cost benefits.
Applications in Engineering and Design
- The design of airfoils to optimize lift, drag and stability.
- Optimizing charge distribution and burn in the design of internal combustion engines.
- Optimizing layouts for delays and capacitive and inductive effects in high-speed circuits.
- Optimizing strutural integrity, design parameters and costs for structrures.
- Design and simulation of micro- and nano-scale systems.
- Process optimization and opperations research.
Scientific Applications
- Functional and structural characterization of genes and proteins.
- Advances in computational physics and chemistry explore new materials and chemical pathways.
- Areas in astrophysics explore galaxy evolution, thermonuclear processes and analyze telescope data.
- Weather modeling, mineral prospecting and flood prediction.
- Bioinformatics and astrophysics present challenges in analyzing large datasets.
Commercial Applications
- Parallel computers power Wall Street.
- Data mining and analysis optimize business and marketing.
- Large-scale mail and web servers use parallel platforms.
- Applications like information retrieval and search use large clusters for parallelism.
Applications in Computer Systems
- Uses include network intrusion detection, cryptography and multiparty computations.
- Embedded systems use distributed control algorithms.
- Modern automobiles use several processors for handling and performance optimization.
- Peer-to-peer networks utilize overlay networks and algorithms from parallel computing.
Scope of Parallelism
- Conventional computers includes a datapath, memory system and a processor.
- These components introduce performace related bottlenecks.
- Data-intensive applications utilizes a high throughput.
- Server applications use a high network bandwidth.
- Scientific applications need high memory performance and high processing.
Implicit Parallelism: Trends in Microprocessor Architectures
- Microprocessor clock speeds have rapidly increased over recent decades.
- Different processors are used for specific devices (microprocessors, microcontrollers, embedded processors, digital signal processors).
- CPUs are classified as single-core, dual-core, quad-core, hexa-core, octa-core, and deca-core.
- Higher device integration allows for more transistors.
- Current processors use resources in multiple functional units, executing many instructions per cycle.
- Transistors amplify electric current in circuits.
Pipelining and Superscalar Execution
- Pipelining betters overall performance.
- A pipelining process executes instructions while decoding the next and fetching the subsequent one.
- Pipelining faces limitations, optimized through very deep pipelines in modern processors (approximately 20 stages in Pentium processors).
- Conditional jumps every 5-6 instructions leads to performance penalties and a large number of instructions have to be flushed, related to the depth of the pipeline.
- Multiple pipelines can alleviate such bottlenecks.
Superscalar Execution
- Factors determining scheduling of instructions
- True Data Dependency makes the result of one operation the input to the next.
- Resource Dependency means that two operations require the same resource.
- Branch Dependency refers to scheduling conditional branch statements which cannot be predetermined.
- The scheduler looks at a number of instructions from a que and selects a number of instructions to execute based on these factors above.
- The complexity of the hardware constrains superscalar processors.
Superscalar Execution: Issue Mechanisms
- Instructions are issued in order as they are encountered.
- If the second instruction has a data dependency with the first one, then only one instruction is issued every cycle, called in-order issue.
- Instructions are issued out of order.
- If the second instruction has dependencies with the first instruction, the first and third instructions can be scheduled this is called dynamic issue.
- In-order issues are generally limited.
Superscalar Execution: Efficiency Considerations
- Not all functional units work at all times.
- Vertical waste is when no functional units are in use.
- Horizontal waste is when some functional units are utilized.
- Performance is limited by lack of parallelism and inability of the scheduler to extract parallelism.
Very Long Instruction Word (VLIW) Processors
- Hardware costs and the complexity of the superscalar schedule are a major concern in processor design.
- VLIW processors identify and group intructions that can be executed concurrently during compile time.
- This was used in the Multiflow Trace machine (circa 1984).
- Intel IA64 processors may use this concept.
- Limited to 4-way to 8-way parallelism.
Limitations of Memory System Performance
- Memory systems, and not processor speed, are often the bottleneck.
- The memory system's performance is measured with bandwidth and latency
- Latency is the measure of time from the demand to when the data is available at the processor.
- Bandwidth is the rate at which data is pumped to the processor.
Memory System Performance: Bandwidth and Latency
- Bandwidth refers to the rate of data transfer (gallons/second).
- Latency refers to the delay before the data transfer begins (seconds).
- Reducing latency leads to a better immediate response
- High bandwidth is needed when fighting big fires.
Memory Latency: An Example
- A 1 GHz processor (1 ns clock) connects to DRAM with 100 ns latency which means that no data caches exist.
- With multiply-add units, the processor can execute four intructions every 1 ns.
- A measure of computer performance comes from the peak processor rating is 4 GFLOPS and is often usefull for scientific computations.
- The processor must wait 100 cycles before it can process the data since memory latency is 100ns.
Memory Latency: An Example (Continued)
- On a problem computing a dot product of two vectors for the architecture mentioned:
- Each floating point operation performs one multiply-add on a single pair vector elements and needs one data fetch.
- Due to the peak speed limitation, it is limited to one floating point operation every 100ns representing a very small fraction of the processors peak rating!
Improving Effective Memory Latency Using Caches
- Caches are used as a fast memory element between the processor and the RAM.
- Caches act as high-bandwidth storage with low-latency.
- Caches will reduce the latency of the memory system if data is reused.
- The cache hit ratio is the ratio of data references satisfied by the cache.
- The cache hit ratio acheived determines performance.
Impact of Caches: Example
- A processor has 2 multiply-add units and performs 4 instructions per 1 ns cycle at 1 GHz with DRAM latency of 100 ns, no caches
- A cache of 32KB is added which takes approximately 200 microseconds with a 1 ns latency.
- Two matrices are multiplied and the cache is large enough to result the matrix.
Impact of Caches: Example Part 2
- Fetching the two matrices into the cache equals the 32 KBs at 200 µs.
- Multiplying two nxn matrices equals 32 x 32 operations for 64 K operations for this specific problem and it can be preformed in 16K cycles.
- The total time equals time for load/store operations with time for computation equaling approximately 200 + 16 µs.
- A peak computation rate roughly equals 64K / 216 or 296.29 FLOPS.
Impact of Memory Bandwidth
- It is determined by the bandwidth of the memory bus which connects the memory units.
- The memory bus wires/conductors connect electrical components and allow transfers from RAM to the memory controller or the CPU.
- The memory bandwidth may be improved by expanding memory capacity.
- Memory bandwidth = time (s) to deliver b units of data for where b represents the data block size.
Impact of Memory Bandwidth: Example
- There are 4 words in the block size instead of 1 word.
- A bandwidth of the tracks to data to the procressor = 64 bits.
- A frequency of bus connects the memory and the processor is 800MHz.
- Dot product vectors computation has 8 Flops (4 multiply-adds) for 200 cycles.
Impact of Memory Bandwidth: Example (Continued)
- When accessing in a single memory fetches four words into the vector.
- data access of memory can fetch four words to each of the vectors. Thus the transmit time to a FLOP every 25 ns which corresponts to (200/8).
- Data bandwidth = 64 bits x 800 MHz, or 51200 MBit/s. This is 6400 M Byte/s divided by 8 for bits to bytes.
- Solving for speed = 6400 / 200, the result would be 32 MFLOPS
Impact of Memory Bandwidth (Continued)
- Data paths are called buses and are responsible for transferring data to the computers hardware which accounts for memory, input and output devices, processors etc.
- Measured in bits, system performance improves as bus capacity increases.
- Increasing the block size of world does not change the system's latency
- viewed as a wide data bus attached to multiple memory banks (4 words or 128 bits).
- More practical systems send subsequent bus cycles after the first word appears.
Impact of Memory Bandwidth (Continued).
- The examples from above indicate hat increased bandwidth results in higher peak computation rates.
- Assuming data layouts are sequential, data words are used by subsequent instructions (spatial locality of reference).
- computations are reordered to promote spatial reference locality with a focus on data layout.
Impact of Memory Bandwidth: Specific Example
- Code fragment sums comumns of the b matrix into a column_sum vector.
Specific Example of Impact of Memory Bandwidth (Continued)
- The column_sum vector is small and easily fits in cache.
- Matrix b is accessed in a column order.
- Poor performance may result from strided access.
Impact of Memory Bandwith Solution
- The matrix is traversed by traversing a row order where increased performance may occure.
Memory System Performance: Summary
- For increasing memory bandwidth and amortization, spatial and temporal locality in applications are key.
- The ratio between operations and accesses is an indicator to indicate the memory bandwidth.
- Spatial and temporal locality is significantly impacted by organizing memory layouts and computing.
Control Structure of Parallel Computing
- Processors use "Hyper Threading" so each core acts like multiple cores, this allows for device manager to recognize 16 processors.
- Smart phones span to have octa-core, quad-core and dual-core variations.
- There are processors which work independantly from a centralized processor in the parallel system.
Control Structure of Parallel Computing (Continued)
- Depending on dispatch level, from processes to instructions, parallelism is expressed.
- Architectural support and range of models exist.
- SIMD dispatches some instruction through a single unit to processors and multiple data streams.
- MIMD allows each processor to have a control unit to instruction multiple data items.
Definitions
- Supercomputer: The fastest computer in the world at the time.
- SIMD: Single instruction stream, multiple data stream
- MIMD: Multiple instruction stream, multiple data stream
- Multiple processors issue instructions asynchronously.
- Shared memory: a common memory for all processors in a MIMD computer.
- UMA: Uniform memory acess by all processors.
- NUMA: Non-uniform memory acess based on location.
- Distributed memory: Memory partitioned into processor-private regions in MIND computer.
- Reduced instruction set computer: RISC.
- Complex functions are performed by simple and faster instructions.
- Processing in assembly-line style is called pipelining.
Computing Taxonomy
- In 1966, Flynn proposed that the taxonomy on instruction streams and data streams
- SISD :Single Instruction stream/ Single Data stream represents a conventional form of a sequantial computer.
- SIMD: Single instruction Stream/ Multiple Data Stream has one control unit delegating instructions to multiple processing units.
- MISD: Multiple Instruction stream/ Single Data Stream group has some including pipelined architectures, however has no architectures.
- MIMD : Multiple Instruction stream/ Multiple Data Stream group represents each unit being able to use a different unit control.
MIMD Architectures
- MIMD classification include different program structures and different address space organization
- SPMD or Single program Multiple data for programming structure
- MPMD or Multiple Program multiple data for programming structure Address space organization:
- Distributed Memory Arch
- Shared memory arch
- Includes multiprocessors and multi Core.
SPMD
- Tasks are run simulataneously and split onto multiprocessors for running independantly and has multiple data and a general subcategory classification under MIMD.
- It represents a main paradigm within parallel programming.
- In SPMD, autonomous units simultanueously may excecute the same general paradigm.
Message Passing Architectures
- Need messages, so the name refers to messages that transfer data work and synch actions of a process on mission of an internet connection
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.