Podcast
Questions and Answers
What is a primary reason for parallelizing programs?
What is a primary reason for parallelizing programs?
Parallel algorithms always guarantee better results in terms of speed.
Parallel algorithms always guarantee better results in terms of speed.
False
Name one benefit of parallel computing.
Name one benefit of parallel computing.
Increased computational speed or bigger simulations.
The main focus of program parallelization is to do things ___ or ___ in less time.
The main focus of program parallelization is to do things ___ or ___ in less time.
Signup and view all the answers
Match the following concepts related to parallel computing:
Match the following concepts related to parallel computing:
Signup and view all the answers
Which loop in the code can be parallelized?
Which loop in the code can be parallelized?
Signup and view all the answers
Data dependencies can enhance the speedup of parallelization.
Data dependencies can enhance the speedup of parallelization.
Signup and view all the answers
What probably runs more efficiently on a sequential machine?
What probably runs more efficiently on a sequential machine?
Signup and view all the answers
The presence of __________ between events limits parallelization strategies.
The presence of __________ between events limits parallelization strategies.
Signup and view all the answers
Match the following scenarios with their outcomes:
Match the following scenarios with their outcomes:
Signup and view all the answers
What does Gustafson’s law primarily focus on?
What does Gustafson’s law primarily focus on?
Signup and view all the answers
Loop-carried dependences can be parallelized effectively.
Loop-carried dependences can be parallelized effectively.
Signup and view all the answers
What is the equation for the scaled speedup defined by Gustafson's law?
What is the equation for the scaled speedup defined by Gustafson's law?
Signup and view all the answers
The process of adapting an algorithm to allow for __________ may involve rewriting the algorithm.
The process of adapting an algorithm to allow for __________ may involve rewriting the algorithm.
Signup and view all the answers
Match the following terms with their definitions:
Match the following terms with their definitions:
Signup and view all the answers
Which statement regarding the Jacobi iteration is true?
Which statement regarding the Jacobi iteration is true?
Signup and view all the answers
Thread-privatizing scalar variables can help increase parallelism in loops.
Thread-privatizing scalar variables can help increase parallelism in loops.
Signup and view all the answers
What should be removed from loops to increase parallelism?
What should be removed from loops to increase parallelism?
Signup and view all the answers
What does Amdahl’s law state about the maximum speedup in parallel computing?
What does Amdahl’s law state about the maximum speedup in parallel computing?
Signup and view all the answers
Poor temporal locality occurs when the same memory location is accessed frequently and repeatedly.
Poor temporal locality occurs when the same memory location is accessed frequently and repeatedly.
Signup and view all the answers
What is meant by 'memory wall' in the context of CPU and memory performance?
What is meant by 'memory wall' in the context of CPU and memory performance?
Signup and view all the answers
In optimization, reducing the fraction ____ is important for achieving higher performance.
In optimization, reducing the fraction ____ is important for achieving higher performance.
Signup and view all the answers
Match the following concepts with their definitions:
Match the following concepts with their definitions:
Signup and view all the answers
Which of the following is NOT a limitation to optimal sequential execution?
Which of the following is NOT a limitation to optimal sequential execution?
Signup and view all the answers
Compiler optimizations should be used sparingly for best performance in sequential execution.
Compiler optimizations should be used sparingly for best performance in sequential execution.
Signup and view all the answers
Name one technique that can be used to analyze the performance of code.
Name one technique that can be used to analyze the performance of code.
Signup and view all the answers
Which component is NOT typically part of a classic 5-stage instruction pipeline?
Which component is NOT typically part of a classic 5-stage instruction pipeline?
Signup and view all the answers
In uniprocessors, instruction scheduling is part of the compiler optimizations.
In uniprocessors, instruction scheduling is part of the compiler optimizations.
Signup and view all the answers
What are the two types of memory models mentioned in the context of multiprocessors?
What are the two types of memory models mentioned in the context of multiprocessors?
Signup and view all the answers
The instruction pipeline increases the __________ bandwidth.
The instruction pipeline increases the __________ bandwidth.
Signup and view all the answers
Match the following stages of the instruction pipeline with their functions:
Match the following stages of the instruction pipeline with their functions:
Signup and view all the answers
What does a superscalar architecture allow?
What does a superscalar architecture allow?
Signup and view all the answers
Instruction pipelining allows for the execution of instructions only sequentially.
Instruction pipelining allows for the execution of instructions only sequentially.
Signup and view all the answers
What role do caches and TLB play in data storage for uniprocessors?
What role do caches and TLB play in data storage for uniprocessors?
Signup and view all the answers
In a 4-stage pipeline, instructions are processed in a specific __________.
In a 4-stage pipeline, instructions are processed in a specific __________.
Signup and view all the answers
Which of the following is NOT a characteristic of multiprocessors?
Which of the following is NOT a characteristic of multiprocessors?
Signup and view all the answers
What does the find_gcf1 function primarily compute?
What does the find_gcf1 function primarily compute?
Signup and view all the answers
The find_gcf2 function uses the modulo operator to find the GCF.
The find_gcf2 function uses the modulo operator to find the GCF.
Signup and view all the answers
What type of dependence is read-after-write classified as?
What type of dependence is read-after-write classified as?
Signup and view all the answers
The find_gcf1 function will return ___ if 'a' is equal to 1.
The find_gcf1 function will return ___ if 'a' is equal to 1.
Signup and view all the answers
Which type of dependence is characterized as 'false dependence'?
Which type of dependence is characterized as 'false dependence'?
Signup and view all the answers
Match the following types of data dependences with their descriptions:
Match the following types of data dependences with their descriptions:
Signup and view all the answers
In the find_gcf2 function, if 'a' is less than 'b', 'b' will be subtracted from 'a'.
In the find_gcf2 function, if 'a' is less than 'b', 'b' will be subtracted from 'a'.
Signup and view all the answers
What happens when b equals zero in the find_gcf1 function?
What happens when b equals zero in the find_gcf1 function?
Signup and view all the answers
Study Notes
Introduction
- Parallel programming aims to create faster and bigger programs.
- Benefits depend on speedup, efficiency, and scalability of algorithms.
- Limitations to parallel speedup exist.
- Future computing needs faster and more powerful solutions.
Books
- HPC: "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford
- OPT: "The Software Optimization Cookbook" (2nd ed.) by R. Gerber, A. Bik, K. Smith, and X. Tian
- PP2: "Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers" (2nd ed.) by B. Wilkinson and M. Allen
- PSC: "Parallel Scientific Computing in C++ and MPI" by G. Karniadakis and R. Kirby II
- SPC: "Scientific Parallel Computing" by L.R. Scott, T. Clark, and B. Bagheri
- SRC: "Sourcebook of Parallel Programming" by J. Dongara, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White
Why Parallel?
- Parallel algorithms are not always obvious to benefit.
- Faster: doing the same work in less time.
- Bigger: doing more work in the same time.
Faster, Bigger!
- Demand for computational power increases for real-world problems
- Computations must be completed in a reasonable time.
- Real-time computation is needed.
Example: Weather Prediction
- Earth's atmosphere is divided into 5x108 cubes.
- 200 floating-point operations per cube are needed for one time step.
- 104 time steps are required for a 7-day forecast.
- 1015 floating-point operations are needed.
- A 1 GFLOP machine takes 106 seconds (~10 days) for these calculations.
Grand Challenge Problems
- "Grand Challenge" problems are ones that cannot be solved within a reasonable time.
- Examples include modeling complex environmental systems, biomedical imaging, molecular biology, computational sciences, and nuclear power/weapons simulations.
CPU Energy Consumption & Heat Dissipation
- Power requirements of newer chips may result in CPUs that are hotter than the sun in 2010.
- Parallelism helps conserve energy.
Important Factors
- Physical limitations (speed of light etc.)
- Economic factors (cheaper, comparable component performance)
- Scalability (allows data to be divided over CPUs)
- Memory (increase bandwidth)
Why not Parallel?
- Bad parallel programs can be worse than their sequential counterparts: Slower due to communication overhead.
- Scalability: parallel execution is only faster for large problem sizes.
- Need to know the nature of a problem before determining whether it can be solved in parallel.
Collatz Conjecture
- Collatz algorithm is inherently sequential for any integer greater than 0.
- A method is shown to explain the algorithm, showing it is sequential.
Speedup
- Speedup (Sp) is the ratio of sequential execution time to parallel execution time.
- Perfect/ideal speedup: Sp = P
- Linear speedup: Sp ≈ P
- Superlinear speedup: Sp > P
Relative Speedup
- Relative speedup defines relative speedup as the ratio of execution time on one processor to the execution time on k processors.
- Relative speedup is used when k is the smallest number of processors for a given problem size.
Example: Parallel Search
- Partition search space into sections for parallel execution.
- Best-case sequential execution speedup is 1.
- Worst-case for sequential execution speedup approaches infinity as the execution time approaches zero (At tends to 0).
Effects that can Cause Superlinear Speedup
- Cache effects: individual data items fit entirely in the data cache leading to superlinear speedups
- Algorithms with linear speedup, the reduction in data cache misses can lead to superlinear speedup.
Efficiency
- Efficiency (Ep) is the ratio of speedup(Sp) to the number of processors(P)
- Ideal (or perfect) speedup means 100% efficiency (Ep = 1).
- Difficult-to-parallelize algorithms have efficiency that approaches zero as the number of processors increases.
Scalability
- An algorithm is scalable if there is a minimum efficiency ɛ > 0 such that given any problem size N there is a number of processors P(N) which tends to infinity as N tends to infinity such that the efficiency EP(N) ≥ ɛ > 0 as N is made arbitrarily large
- Speedup describes how the parallel algorithm's performance changes with increasing P
- Scalability concerns the efficiency of the algorithm with changing problem size N
Amdahl's Law
- Amdahl's Law limits the speedup achievable by adding more processors.
- This is because a fixed fraction of a program is inherently sequential and cannot be parallelized, leading to a limit on achievable speedup no matter how many processors are used.
- Let f be the fraction of the computation that is sequential. Total execution time on P processors is given by the formula: tp = (1-f)ts/P + f*ts
Gustafson's Law
- Gustafson's Law defines scaled speedup by considering the parallel time to be constant.
- This means that as the problem size increases, the number of processors can also increase to maintain the same parallel execution time.
Limitations to Speedup: Data Dependences
- Data dependences limit parallelization strategies to conserve energy.
- Loops with loop-carried dependences cannot be parallelized.
- Changing loops to remove dependences is needed to increase parallelism.
- Algorithmic changes in rewriting can increase parallelism.
Memory Locality
- Memory hierarchy forms a barrier to performance when locality is poor.
- Temporal locality: accessing the same memory location frequently, e.g., repeatedly.
- Spatial locality: accessing consecutive memory locations close to each other.
Efficient Sequential Execution
- Memory effects limit optimal sequential execution.
- Store/Load dependences, where data must move through memory
- Cache misses, TLB misses, and page faults
- CPU resource effects including limited floating-point unit count.
- Compiler optimizations are important.
- Performance analyzers helps analyze performance and investigate overhead causes.
Uniprocessors, overview
- Uniprocessors' details for processor architectures, instruction set architectures, instruction scheduling and execution, data storage.
- Compiler Optimization: data storage, memory hierarchy, caches, TLB etc.
Multiprocessors, overview
- Multiprocessors' details for pipeline and vector machines, shared memory, distributed memory, message passing, parallel programming models (shared vs. distributed memory, hybrid, BSP).
Levels of Parallelism
- Instruction-level parallelism (ILP)
- Hyper-threading
- Multiprocessing
- Distributed processing
Instruction Pipeline
- Instruction pipeline increases the Instruction bandwidth
- Classic 5-stage pipeline
- IF- Instruction Fetch
- ID- Instruction Decode
- EX- Execute
- MEM - Memory
- WB- Write Back
- 5 instructions are in the pipeline at different stages
Instruction Pipeline Hazards
- Data dependences
- Instruction fetch latencies
- Memory load latencies.
- A hazard that is resolved by a stall in the pipeline, which causes a bubble of one or more cycles.
N-way Superscalar
- When instructions have the same word size, multiple instructions can be fetched.
- N-way superscalar processors fetch N instructions per cycle.
- This increases instruction-level parallelism.
Instruction Latency Case Study
- Two versions of Euclid's algorithm
- Modulo version
- Repetitive subtraction version
Data Dependences
- Instruction-level parallelism is limited by data dependences.
- Types of data dependences RAW - read after write WAR - write after read WAW - write after write
Case Study (1)
Parallel example showing flow dependence and memory load/loads for an example loop.
Instruction Scheduling
- Instruction scheduling can hide data dependence latencies.
- Static scheduling: compiler moves independent instructions to fixed positions.
- Dynamic scheduling: CPU executes instructions out-of-order, when ready
Hardware Register Renaming
Register renaming is performed by a processor to remove unnecessary WAR dependences caused by register reuse. Using a fixed number of registers is useful for compilers.
Data Speculation
- A load should be initiated as far in advance as possible to hide memory latency.
- A compiler assumes a RAW dependence if it cannot disprove A1 = A2.
- Advanced load instructions allow a process to ignore a potential RAW dependence and sort out conflict at run time. When a store and load are to the same location, then it is a RAW dependence
Control Speculation
- Control speculation allows conditional instructions to be executed before the conditional branch in which the instruction occurs.
- It hides memory latencies and enables loading early.
Hardware Branch Prediction
- Branch prediction is an architectural feature that enables the processor to fetch instructions of the target branch.
- No branch penalty if predicted correctly.
- A longer penalty if incorrectly predicted.
- Branch prediction typically uses a history mechanism.
Performance of Storage
- Access speeds vary from registers (fast) to cache (tens of cycles) to memory (hundreds of cycles) to disk storage (slowest).
- Each level is backed by the next slower level in the hierarchy.
- This allows the fastest access method to be used whenever possible.
Memory Access
- Logical address is translated into a physical address in virtual memory.
- TLB is a translation lookaside buffer for efficient on-chip address translation.
- Memory is divided into pages. If a page is missing, a page fault occurs.
- Page fault: page is fetched from the disk.
Translation Lookaside Buffer (TLB)
- A hardware cache to speed up address translation.
- Reduces the need to access the page table in physical memory, improving performance.
Caches
- Direct-mapped caches: each location in main memory can be cached only by one cache location.
- N-way associative caches: each location in main memory can be cached by one of N cache locations.
- Fully associative caches: each location in main memory can be cached by any cache location.
Cache Details
- Cache lines can be between 8-512 bytes.
- Longer lines increase memory bandwidth and increase wasted space.
- For example, 8-way L1 cache has 64 rows & 64 bytes, 8-way L2 cache has 1024 rows & 128 bytes.
Cache Misses
- Compulsory misses: First access to a datum
- Capacity misses: When cache is full
- Conflict misses: Mapping scheme conflicts
False Sharing
- False sharing occurs when multiple caches of processors (or cores) cache different data items but are on the same cache line.
- Cache coherency forces the reload of a cache line to avoid data inconsistency.
- Allocate non-shared data on different cache lines to avoid false sharing.
Multiprocessors
- Brief overview of multiprocessors
PMS Architecture Model
- Processor (P): A device that performs operations on data.
- Memory (M): A device that stores data.
- Network/Switch(S): A device that facilitates transferring data between devices.
Shared Memory Multiprocessor
- Processors share a common switch(e.g., bus) for memory access.
- Shared memory has a single address space.
- Bus contention occurs when several processors try to access memory simultaneously.
- A simple/single bus results in a bottleneck.
Shared Memory: The Bus Contention Problem
- Each processor competes for access to shared memory (for instructions and data).
- Memory access is restricted to a single processor at a time.
- This limits speedup and scalability with respect to the number of processors.
Shared Memory: Work-Memory Ratio
- Work-memory ratio is the ratio of floating-point operations to the number of distinct memory locations referenced in the innermost loop.
Shared Memory with Local Cache
- Adding local caches improves performance when the work-memory ratio is smaller than 1.
- The problem is ensuring cache coherence.
Shared Memory: Cache Coherence
- Cache coherence protocols ensure that processors obtain the most up-to-date data from shared memory when it's modified by another processor.
- Caches operate on cache lines, and more data than the shared object can be affected, leading to false sharing.
COMA (Cache-only Memory Architecture)
- A technique that stores frequently accessed data in caches, thereby eliminating the need to access shared memory.
Distributed Memory Multicomputer
- Large, massive parallel processing (MPP) systems with over 1000 processors.
- Communication takes place via message passing.
- This differs significantly from a shared memory architecture; instead of a common memory space, each processor has its own local memory.
Distributed Shared Memory (DSM)
- A distributed memory system with a global address space that gives an illusion of shared memory to processes.
- Hardware translates memory addresses into local addresses.
- Software approaches add a programming layer to simplify access to shared objects.
Computation-Communication Ratio
- Computation-communication ratio is defined as the relationship between the time spent computing and the time spent communicating between processors, as the number of processors is increased, the communication time grows, whereas the computation time should decrease, leading to a limit in the theoretical maximum speedup.
Mesh Topology
- A network topology where processors are connected in a grid pattern, forming a 2d network.
- Diameter is calculated as 2 × (√P − 1), where P is the number of processors.
- A torus is a special case of a mesh, where the ends are connected.
Hypercube Topology
- A network topology where processors are connected in a hypercube (d-dimensional), providing more connections.
- The diameter is given by d = log₂P, where P is the number of processors.
Cross-Bar Switches
- Allows simultaneous communication between processors.
- Allows simultaneous (contention-free) communication between any given processor and any given memory via the switches.
Performance Analysis
- Overview of different methods to measure performance.
What to Measure?
- Overview of different concepts for measuring performance including elapsed execution time, number of instructions executed, floating-point operations/second, system events (cache hits), identifying/determining overhead, measuring communication latency, measuring parallel speedup.
Measuring Elapsed Time: Real, User, and System Time
- Real time (wall clock time) is the total time elapsed from start to end of a task.
- CPU user time = time spent executing on user space, excluding system calls.
- CPU system time = time spent executing system calls and processes in kernel space.
Measuring Elapsed Time: System Timers (libc)
- Overview of system APIs for obtaining time information (times(), getrusage(), gettimeofday(), gethrtime(), GetSystemTime()).
- MPI Wtime() method provides wall clock time accurate to seconds.
Measuring Elapsed Time: Hardware Timers (RDTSC)
- RDTSC instruction for accurate CPU cycle counting.
- Multicore systems require synchronization for accurate results.
Measuring Elapsed Time: Benchmarking
- Methods for estimating MFLOPs, handling inaccurate timer resolutions, running multiple iterations of an algorithm.
Profiling
- A detailed overview of instrumentation-based and sampling-based methods.
Instrumentation-Based Profiling
- Detailed explanation of the concept, instrumentation code and use of counters to record events during program execution.
- Description of "Heisenbugs" that may occur due to changes in timing during profiling.
Sampling-Based Profilers
- Detailed explanation of sampling-based profiling techniques.
- Description of call graph and flat profiling output formats and methods.
- Statistical inaccuracies in sampling-based profiling are explained, along with methods to improve them by using multiple sampling and/or combining outputs.
Hardware Counter Profiling
- Using special CPU registers to count CPU events.
- Limitations on the counter count and configuration examples for UltraSparc IIIi.
- Counter overflow handling using interrupts to prevent data loss.
HW Counter Profiling with Sun Studio Profiler
- Using Sun Studio profiler to obtain a list of available hardware counters.
- Examples provided including counting instructions, FP operations, and cache misses.
Iterative Optimization: The Profile + Improve Cycle
- Iterative optimization strategy: profiling, identifying hotspots, optimizing, repeating to improve overall application performance
- Focus on finding hotspots by profiling and optimizing them to reduce execution time
Finding Hotspots
- Finding critical components, especially those often executed, causing slow execution time.
- Finding data-intensive portions that cause high cache miss rates and slow down execution.
Profile-Guided Optimization (PGO)
- PGO helps improve I-cache, paging, and branch prediction.
- Basic block ordering - better register allocation
- Better decision of functions to inline and the placement of loops for execution
- Switch-statement optimization, enabling aggressive loop optimization
CPU Bound versus I/O Bound Processes
- A detailed breakdown/comparison of CPU-bound and I/O-bound processes.
Measuring Message Passing Performance
- Description of message-passing strategies using ping-pong measurements.
- Different network topologies impact message latency and bandwidth.
Time Complexity of a Parallel Program
- Parallel computation time and parallel communication time.
- Ideal speedup
- Communication time vs computation time affects parallelism speedup
Performance Prediction Graph
- Log-log graph used to predict performance in relation to communication time and computation time, depending on the number of processors and the proportionality of computation time and communication time
- The critical number of processors is when communication time is very low.
Overlapping Computation with Communication
- Methods for overlapping computation with communication to improve performance.
Parallel Programming Models
- Overview of different parallel program models for programming applications, including multiprogramming, shared address space model, data parallel, message passing, hybrid models, BSP models.
Programming Model 1: Shared Address Space
- Task parallelism (with thread-based MIMD)
- Data in a shared address space
- Global variables and heap memory
- Thread state data is in the runtime stack; threads explicitly synchronize with ops on shared variables, involving thread creation/join, reading/writing flags, locks, and semaphores.
Programming Model 1: Uniform Memory Access (UMA)
- Description of uniform memory access shared memory architectures (UMA).
- Processor access time is uniform for all memory locations irrespective of their location, in terms of physical arrangement.
- Example diagram provided.
Programming Model 1: Nonuniform Memory Access (NUMA)
- Description of nonuniform memory access shared memory architectures (NUMA).
- Memory access times depend on the location of the data relative to the processor.
- Access time to local memory is faster than to remote memory.
- Example diagram provided.
Programming Model 1: Distributed Shared Memory (DSM)
- Description of distributed shared memory architectures (DSM).
- A distributed memory system with a logically shared address space.
- Data access takes place using message passing.
Programming Model 1: Example Code
- Examples provided of the process (threads), shared memory issues, race conditions.
Programming Model 2: Data Parallel
- Data parallel programming using a single thread of control.
Programming Model 3: Message Passing
- Description of message passing programming models.
- Each process has its thread, local memory and address space
- Communicates via explicit data transfers on message-passing systems
- Messages sent by a source to a destination processor (or compute node)
- Logically shared data explicitly partitioned across local memories
Programming Model 3: Message Passing Example Code
- Examples of message passing usage between processors.
Programming Model 4: Hybrid Systems
- Hybrid systems combine shared memory within SMP clusters and message passing outside, providing performance benefits.
- Combining UMA and NUMA intelligently is possible.
Programming Model 5: Bulk Synchronous Processing (BSP)
- A parallel program is composed of supersteps
- The compute phase, communication phase and barrier synchronization phases form a superstep in the BSP model.
Programming with Shared Memory
- Brief overview of shared memory concept and programming concepts.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on parallel computing and its underlying principles. This quiz covers key concepts such as Gustafson’s law, speedup equations, and various programming strategies. Challenge yourself with matching terms, identifying parallelizable loops, and understanding data dependencies.