Slide #1.

Chapter ∞: Parallel Algorithms • PCAM • SIMD & MIMD • Designing Parallel Programs CS 340 Page 1
More slides like this


Slide #2.

PCAM When designing algorithms for systems of parallel processors, two primary problems must Avoid processor idle time by balancing the be• addressed: e Probl m Partition processing load between all of the processors. • Avoid time-consuming communication between processors by allocating interdependent tasks to the same processor. One common approach: PCAM • Partition: Split the problem into a large number of small tasks to expose opportunities for parallel • execution. Communicate: Identify the data flow that is required between separate tasks. • Agglomerate: Combine tasks to minimize communication and balance loads; evaluate data and • replication Map: Assignpros tasks tocons. processors in a manner that improves concurrency while maintaining locality. Communica te Agglomera te Map CS 340 Page 2
More slides like this


Slide #3.

PCAM Example: VLSI Floorplan Assume that you are designing a microchip with several layout restrictions: • The northwest corner of the chip must contain a 55 component of type A. • Right next to the type-A component on the north side of the chip, a 24 square unit B iscomponent’s needed. • component Adjacent to of thetype type-B east or south side, a 40-square unit component of type C is needed. Possibilities (among many others): CS 340 The idea is to minimize the overall area of Page 3 the chip.
More slides like this


Slide #4.

VLSI Floorplan: Algorithm Outline The common sequential approach for this problem is a tree-based, depth-first technique called branchand-bound, in which subtrees are pruned whenever it becomes clear that they become too costly. Component Component A A Placement Placement Chip Size: 25 Component Component B B Placement: Placement: Six Six Possibilities Possibilities Chip Size: 54 Chip Size: 55 Chip Size: 64 Chip Size: 85 Chip Size: 65 Chip Size: 84 Component Component C C Placement Placement (B1 (B1 Version): Version): Twelve Twelve Possibilities Possibilities Chip Size: 114 Chip Size: 144 Chip Size: 200 CS 340 Chip Size: 174 Chip Size: 143 Chip Size: 140 Chip Size: 112 Chip Size: 130 Chip Size: 220 Chip Size: 150 Chip Size: 234 Chip Size: 102 Page 4
More slides like this


Slide #5.

Partitioning There is no obvious data structure that could be used to perform a decomposition of this problem’s domain into components that could be mapped to separate processors. Chip Chip Size: Size: 25 25 Chip Chip Size: Size: 54 54 Chip Chip Size: Size: 55 55 Chip Chip Size: Size: 64 64 Chip Chip Size: Size: 85 85 Chip Chip Size: Size: 65 65 Chip Chip Size: Size: 84 84 Chip Chip Size: Size: 114 114 Chip Chip Size: Size: 144 144 Chip Chip Size: Size: 200 200 Chip Chip Size: Size: 174 174 Chip Chip Size: Size: 130 130 Chip Chip Size: Size: 140 140 Chip Chip Size: Size: 143 143 Chip Chip Size: Size: 112 112 Chip Chip Size: Size: 220 220 Chip Chip Size: Size: 150 150 Chip Chip Size: Size: 234 234 Chip Chip Size: Size: 102 102 A fine-grained functional decomposition is therefore needed, where the exploration of each search tree node is handled by a separate task. CS 340 This means that new tasks will be created in a wavefront as the search progresses down the search tree, which will be explored in a breadthfirst fashion. Notice that only tasks on the wavefront will be able to execute concurrently. Page 5
More slides like this


Slide #6.

Communication In a parallel implementation of simple search, tasks can execute independently and need communicate only to report solutions. Chip Chip Size: Size: 25 25 Chip Chip Size: Size: 54 54 Chip Chip Size: Size: 55 55 Chip Chip Size: Size: 64 64 Chip Chip Size: Size: 144 144 Chip Chip Size: Size: 174 174 CS 340 Chip Chip Size: Size: 84 84 Chip Chip Size: Size: 130 130 Chip Chip Size: Size: 140 140 Chip Chip Size: Size: 143 143 Chip Chip Size: Size: 85 85 Chip Chip Size: Size: 65 65 Chip Chip Size: Size: 114 114 Chip Chip Size: Size: 200 200 The parallel algorithm for this problem will also need to keep track of the bounding value (i.e., the smallest chip area found so far), which must be accessed by every task. One possibility would be to encapsulate the bounding value maintenance in a single centralized task with which the other tasks will communicate. This approach is inherently unscalable, since the processor handling the centralized task can only service requests from the other tasks at a particular rate, thus bounding the number of tasks that can execute concurrently. Chip Chip Size: Size: 112 112 Chip Chip Size: Size: 220 220 Chip Chip Size: Size: 150 150 Chip Chip Size: Size: 234 234 Chip Chip Size: Size: 102 102 Page 6
More slides like this


Slide #7.

Agglomeration Practically speaking, parallelism will have to be constrained by the number of processors being employed on the program. For this problem, an obvious agglomeration would be to split the breadth first search into tasks at each node until a certain level of the tree is reached, at which point each subtree can be handled as a sequential depth-first search by its assigned processor. Chip Chip Size: Size: 25 25 Chip Chip Size: Size: 54 54 Chip Chip Size: Size: 55 55 Chip Chip Size: Size: 64 64 Chip Chip Size: Size: 65 65 Chip Chip Size: Size: 85 85 Chip Chip Size: Size: 84 84 B1 B1 Subtre Subtre e e B2 B2 Subtre Subtre e e B3 B3 Subtre Subtre e e B4 B4 Subtre Subtre e e B5 B5 Subtre Subtre e e B6 B6 Subtre Subtre e e Of course, this step is more significant if the number of components being placed on the chip, and the number of options for placing them, is substantially increased. CS 340 Page 7
More slides like this


Slide #8.

Mapping One priority when mapping tasks to processors is determining how to avoid lengthy idle periods for the processors. For this problem, one processor can be assigned a supervisory role by handling the root node and generating tasks associated with the subtrees. The other nodes maintain a queue of assigned tasks and, whenever their queues are depleted, they request tasks from other nodes, which facilitates balancing. The agglomeration stepload ensured that the tasks are handled in a depth-first manner, which facilitates the pruning of unnecessary tasks. In order to keep the pruning rate as high as possible, processors keep tasks that are far from the root (i.e., more likely to cause pruning soon) and hand tasks that are near the root to idle Page 8 Chip Chip Size: Size: 25 25 Chip Chip Size: Size: 54 54 Chip Chip Size: Size: 55 55 Chip Chip Size: Size: 64 64 B1 B1 Subtre Subtre e e B2 B2 Subtre Subtre e e B3 B3 Subtre Subtre e e CS 340 Chip Chip Size: Size: 65 65 Chip Chip Size: Size: 85 85 Chip Chip Size: Size: 84 84 B4 B4 Subtre Subtre e e B5 B5 Subtre Subtre e e B6 B6 Subtre Subtre e e
More slides like this


Slide #9.

SIMD Parallel processing on SIMD (Single Instruction, Multiple Data) machines is accomplished by having all processors perform the same instruction on different data in a vectorized For applications thatmanner. require huge blocks of data upon which the same set of operations must be performed, both the data retrieval and the data processing can Instructio experience significant performance benefits. n: Instruction: Process or 1 Process or 2 Process or 3 Process or 4 CS 340 Retrieve Data Block A[ ] = A[0… 99] Instruction: Retrieve Data Block A[ ] = A[100… 199] Instruction: Retrieve Data Block A[ ] = A[200… 299] Instruction: Retrieve Data Block A[ ] = A[300… 399] Retrieve Data Block B[ ] = B[0… 99] Instruction: Retrieve Data Block B[ ] = B[100… 199] Instruction: Retrieve Data Block B[ ] = B[200… 299] Instruction: Retrieve Data Block B[ ] = B[300…399] Instructio n: A[0] = A[0] * B[0] Instruction: A[19] = A[3] * B[6] Instruction: A[0] = A[0] * B[0] Instruction : A[19] = A[3] * B[6] Instruction : A[0] = A[0] * B[0] Instructio n: A[19] = A[3] * B[6] Instruction: A[0] = A[0] * B[0] Instruction : A[19] = A[3] * B[6] Page 9
More slides like this


Slide #10.

SIMD Inefficiency CS 340 Process or 1 Parallelism in a SIMD machine does not necessarily eliminate all inefficiency and waste. For example, consider what if (u < v) happens when multiple z = x * y; processors attempt to else execute the conditional at z = x + y; right with their respective Compar Multiply Store e u1 Add x1 x1 anddata: product (10) to and y1 Process or 2 Multiply x2 and y2 Process or 3 Multiply x3 and y3 Process or 4 Multiply x4 and y4 y1 (10) to v1 (20) in z1 Add x2 and y2 Compar e u2 (16) to v2 (12) Store sum in z2 Add x3 and y3 Compar e u3 (14) to v3 (21) Store product in z3 Add x4 and y4 Compar e u4 (15) to v4 (13) Store sum in z4 and y1 Because the processors handle the instructions in lockstep, both condition cases are executed, regardless of the condition’s evaluation. Page 10
More slides like this


Slide #11.

SIMD Applications Note that divideand-conquer algorithms for sequential machines tend to have linear speedups on SIMD  Mergesort, for machines. example, takes O() time on a pprocessor MIMD machine, which translates to O(n) time on an nprocessor CS 340 The limitations of the SIMD approach (i.e., large data sets to which uniform instructions are applied in lockstep fashion) have restricted its applicability to such areas as graphics (e.g., animation, games) and video processing (e.g., image enhancement, video compression, format conversion). Page 11
More slides like this


Slide #12.

MIMD Process or 1 Process or 2 Process or 3 Process or 4 CS 340 Parallel processing on MIMD (Multiple Instruction, Multiple Data) machines is accomplished by having all processors work independently, performing different instructions on different …data. Applications for this type of z = x * y; system involve tasks that can alpha(z, n); … be easily separated from … each other, such as for (i = 1; i < n; i++) { CAD/CAM, modeling and A[i] = 0.0; B[i] = 14.8; simulation, and network } processing, thus minimizing … … the time-consuming Most current supercomputers, z = x * y; communication between alpha(z, n); networked parallel computer … processors. clusters and “grids”, and … multi-core PCs use the MIMD w = u * v; approach. t = 2 * w + v / 10; … Page 12
More slides like this


Slide #13.

Designing Parallel Programs Designing and developing parallel programs has characteristically been a very manual process with the programmer typically responsible for both identifying and actually implementing parallelism. Very often, manually developing parallel codes is a time-consuming, complex, error-prone, and iterative process. The first step is to determine whether or not a problem is one that can actually be parallelized. Example of Parallelizable Problem: Calculate the potential energy for each of several thousand independent conformations of a molecule; when done, find the minimum energy conformation. This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem. CS 340 Example of Non-Parallelizable Problem: Calculation of the Fibonacci series (1, 1, 2, 3, 5, 8, 13, 21, ...) by use of the formula: = F(n-1) + F(n-2) This is aF(n) non-parallelizable problem because the calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. The calculation of the F(n) value uses those of both F(n-1) and F(n-2). These three terms cannot be calculated Page 13
More slides like this


Slide #14.

Problem Decomposition One of the first steps in designing a parallel program is to break the problem into discrete “chunks” of work that can be distributed to multiple tasks. Method 1: Domain Decomposition In this type of partitioning, the data associated with the problem is decomposed. Each parallel task then works on a portion of the data. Problem Data Set Task 1 CS 340 Task 2 Task 3 Task 4 Page 14
More slides like this


Slide #15.

Method 2: Functional Decomposition In this approach, the partitioning is based on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work. Problem Instruction Set Task 1 CS 340 Task 2 Task 3 Task 4 Page 15
More slides like this


Slide #16.

Examples Functional decomposition lends itself to problems that can be split into different tasks, such as: Ecosystem Modeling Each program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then exchanges information with the neighbor populations. All tasks then Signal Processingprogress to calculate the state at the next time step. An audio signal data set is passed through four distinct computational filters. Each filter is a separate process. The first segment of data must pass through the first filter before progressing to the second. When it does, the second segment of data passes through the first filter. By the time the fourth segment of Climate Modeling data is in the first filter, all four tasks are Each model component can be thought of as a separate task. busy. Arrows represent exchanges of data between components during computation: the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the CS 340 atmosphere model, and so on. Page 16
More slides like this