1. Consider the problem of multiplying two nxn matrices: C = A*B A and B are upper triangular matrices, i.e., all elements above the main diagonal are equal to zero. The result C is also upper triangular. Rewrite the serial matrix multiplication such as to avoid multiplications by zero: For j = 1 to n For i = 1 to j // C(i,j) = 0 for i>j, we don't need to compute it For k = i to j // the limits depend on i, j, // such that multiplications by 0 are avoided C(i,j) = C(i,j) + A(i,k)*B(k,j) End k End i End j Write an OpenMP parallel version that runs with an arbitrary number of threads. Use parallel for pragma to parallelize the outer loop (j). Measure the speedups for fixed matrix sizes (n) and different numbers of threads (p). Discuss the load balancing problem (a different workload is associated with different j's). Propose a parallelization strategy where the workload of each thread is about the same. Implement your strategy and measure the performance benefits. 2. Derive an algorithm for multiplying two triangular matrices that is suitable for distributed memory platforms. Implement it using MPI and measure the speedups for fixed matrix sizes (n) and different numbers of processors (p). Discuss the load balancing problem. Offer a solution. (You do not need to implement it, but you need to make a solid argument that the solution works). 3. Implement a parallel version of the WA-TOR problem. Comment on the load balancing and the synchronization issues. Make sure that the parallel version gives the same results as the serial version. The WA-TOR problem is formulated as follows: Sharks and fish live in the ocean. The ocean is represented as an N-by-N grid of "water parcels". The opposing sides of the grid are connected: if a fish or a shark moves out on one side of the simulation domain, it reenters immediately on the opposing side. Each grid cell can be empty or have a fish or a shark (but not both). Initially the grid is populated with fish and sharks distributed randomly (some grid cells are empty). The initial age of each fish and each shark is 0 weeks, Let N = grid size, FB = breeding age of fish (weeks), SB = breeding age of sharks (weeks), SS = starvation time of sharks (weeks), FO = old age when fish die (weeks), SO = old age when sharks die. You can choose your own values for these numbers. Fish and shark move every week (if possible) and interact according to the following set of rules. Rules for Fish: - Each week a fish tries to move to a neighboring empty cell (picked randomly). If the randomly picked neighbor is not empty the fish remains in the current grid cell. - If a fish reaches the FB (breeding age), when it moves, it breeds, leaving behind a fish of age 0. A fish does not breed if it does not move. - A fish dies of old age when it reaches FO weeks. Rules for Sharks: - Each week, if one of the neighboring cells has a fish, the shark moves to that cell eating the fish. If not and if one of the neighboring cells is empty, the shark moves there. Otherwise, it stays. - If a shark reaches SB weeks (breeding age), when it moves, it breeds, leaving behind a shark of age 0. A shark cannot breed if it doesn't move. - Sharks eat only fish. If SS weeks have passed since a shark has last eaten the shark dies of starvation. - A shark dies of old age when it reaches SO weeks. 4. (GRADUATE STUDENTS ONLY) Read Section 10.7.2 from the textbook. Solve problem 10.17 (page 467) in the text. Compute the isoefficiency function as well. 4. (UNDERGRADUATE STUDENTS ONLY) Consider the problem of automatically parallelizing a code. Specifically, we want to design a compiler/translator that will be able to parse any given input code, identify parallel regions, and insert automatically the appropriate OpenMP pragmas for the parallel execution of the code. The resulting code must be: 1. correct (always give the same answers as the serial code); 2. general (work with any number of threads); 3. effective (in particular it has to unravel and exploit all the parallelism in the original code); 4. efficient (it has to ensure a reasonable load balancing etc.) Assume that your first job after you graduate is to solve this problem. We don't expect that you know how to solve this problem. What strategies could you use to find out more about the problem and how to make progress on it? What strategies would you use to evaluate your progress towards solution? The answer should be about 1/2 page long and as complete as possible. (Note: undergraduate question 4 counts for 5% of the final exam grade. It is posed in relation with the ABET accreditation of our department). 5. (EXTRA CREDIT) Implement an OpenMP parallel version of the solution of a linear system. Recall that there are two stages, "LU factorization" and "forward and backward substitution". You can choose to start with the serial algorithm discussed in class, or with any other valid algorithm. Test and measure performance for different n, p values. Comment. 6. (EXTRA CREDIT) Implement an MPI parallel version of the solution of a linear system. Recall that there are two stages, "LU factorization" and "forward and backward substitution". You can choose to start with the serial algorithm discussed in class, or with any other valid algorithm. Test and measure performance for different n, p values. Comment.