Homework Four

The goal of this assignment is write a parallel algorithm using MPI and to investigate the scalability of your algorithm.


Part One

For Part One of this assignment you are to code the naive version of the matrix multiply program as described in lecture in either C or C++ using the MPI message passing library. Your program must compile without warnings and execute correctly on eagle.csce.uark.edu for full credit. For full credit use good programming style, including the use of an appropriate amount of comments. In addition to the source code, your submission must also include the answers, in a plain text file, to the questions found at the end of the page. Specifically:

  1. Assume that the number of rows of the matrix is evenly divisible by the number of processes. Input N, the matrix order, from the command line. (A square matrix of order N has N*N elements in it.)
  2. In each process, initialize its portion of A and B appropriately in a function, A(i,j) = B(i,j) = 1 / (i + j + 1); For those of you who are interested, matrices of this form are called Hilbert matrices. After the initialization, the algorithm must proceed without using specific knowledge that the blocks are those of a Hilbert matrix.
  3. To avoid printing huge matrices, write Print_matrix so that just the first entry in each result row is printed. Thus if p=4 processes are used and N=1024, then 4 numbers are printed: C[0,0], C[255,0], C[511,0], C[767,0].
  4. Insert calls to MPI_Wtime to time just the matrix multiply portion of your code. The code that you submit should have these calls in it.

Also, answer the following summary questions in a plain text file:

  1. Were there any features of the assignment that did you not successfully implement?
  2. How did the input you used test your program thoroughly for its correct operation?
  3. How did you analyze the output of your program to prove to yourself that the output of your program shows your program is working correctly?
  4. How many hours did you spend designing, coding, and debugging this program?

Turn in for Part One:

Email the source code and answers to the questions to Hai Nguyen, hqn01@uark.edu.


Part Two

The purpose of Part Two of this assignment is to investigate the scalability and performance of your parallel code. For Part Two you will prepare a report using Word or another word processing tool. You should submit your report on paper at the start of the class period during which it is due. Your report should be nicely formatted and free of grammatical errors. Your graphs should have an explanation. All portions of the graphs should be labeled appropriately, including the axes. You should include a key or label each curve on the graph. Use standard axes, not logarithmic axes.

Specifically:

  1. Take performance runs of your naive matrix multiplication MPI program using p = 1 and p = 4 and a range of values of N, starting with N=16, and which are powers of 2 and lead to run times up to a minute or two.
  2. Create a graph that displays your timings. The horizontal axis is N and the vertical axis is the elapsed time. There will be two curves on this graph, one for p = 1 and one for p = 4.
  3. Create a second graph that displays the parallel efficency of your timings. The horizontal axis is N and the vertical axis is the parallel efficiency: T_1(n)/4*T_4(n).

  4. Modify fox.c so that:
    1. Take the matrix order N from the command line instead of stdin.
    2. Instead of a root process reading in A and B and sending the blocks to the processes, each process creates the entries in it's block of H from appropriate use of the formula 1/(i + j + 1). Note: Each block does such initialization of a block of H once. Thereafter the algorithm must proceed without using specific knowledge that the blocks are those of a Hilbert matrix. So it is illegal to just create another block rather than getting it by communication from the appropriate process!
    3. In main() have process zero time the call to function fox() using calls to MPI_WTime() both before and after the execution of the timed code.
    4. To limit memory problems, get rid of the global variable temp_mat. Also be sure no process has more than 4 matrix blocks allocated at any one time.
    5. To avoid printing huge matrices, modify Print_matrix so that just the first entry in each block is printed. Thus if p=4 processes are used and N=1024, then 4 numbers are printed: C[0,0], C[0,511], C[511,0], C[511,511].
  5. As above, take performance runs using p = 1 and p = 4 and a range of values of N which are powers of 2 and lead to run times up to a minute or two.
  6. Create one graph that displays your timings of Fox's algorithm. The horizontal axis is N and the vertical axis is the elapsed time. There will be two curves on this graph, for p = 1 and for p = 4.
  7. Create a second graph that displays the parallel efficency of your timings. The horizontal axis is n the vertical axis is the parallel efficiency: T_1(n)/4*T_4(n).
  8. Create a third graph that shows two curves, the timings for the naïve algorithm for p=4 and the timings for Fox's algorithm for p=4.

Create a report that you will turn in on paper. Your report that you turn in will have five graphs, including appropriate explanation of the graphs. In a final summary paragraph, indicate if the graphs represent what you expected, or if your results are in some way different than what you expect. Also, indicate the number of hours that you spent doing this part of the assignment.

Turn in for Part Two:

Email your modified verson of fox.c to Hai Nguyen, hqn01@uark.edu. Turn in your report on paper in class at the start of class on the day that it is due.

Enjoy!