| Homework Four
|
|
The goal of this assignment is write a parallel algorithm using MPI and to
investigate the scalability of your algorithm.
Part One
For Part One of this assignment you are to code the naive version of the matrix
multiply program as described in lecture in either C or C++ using
the MPI message passing library. Your program must compile without warnings
and execute correctly on
eagle.csce.uark.edu for full credit.
For full credit use good programming
style, including the use of an appropriate amount of comments. In
addition to the source code, your submission must also include the answers,
in a plain text file,
to the questions found at the end of the page.
Specifically:
- Assume that the number of rows of the matrix is evenly divisible
by the number of processes. Input N, the matrix order, from the
command line. (A square matrix of order N has N*N elements in it.)
- In each process, initialize its portion of A and B appropriately
in a function, A(i,j) = B(i,j) = 1 / (i + j + 1); For those of you
who are interested, matrices of this form are called Hilbert matrices.
After the initialization, the algorithm must proceed without using
specific knowledge that the blocks are those of a Hilbert matrix.
- To avoid printing huge matrices, write Print_matrix so that just
the first entry in each result row is printed. Thus if p=4 processes
are used and N=1024, then 4 numbers are printed: C[0,0], C[255,0],
C[511,0], C[767,0].
- Insert calls to MPI_Wtime to time just the matrix multiply
portion of your code. The code that you submit should have these
calls in it.
Also, answer the following summary questions in a plain text file:
- Were there any features of the assignment that did you not successfully implement?
- How did the input you used test your program thoroughly for its
correct operation?
- How did you analyze the output of your program to prove to
yourself that the output of your program shows your program is
working correctly?
- How many hours did you spend designing, coding, and debugging this
program?
Turn in for Part One:
Email the source code
and answers to the questions to Hai Nguyen, hqn01@uark.edu.
Part Two
The purpose of Part Two of this assignment is to investigate the
scalability and performance of your parallel code. For Part Two you
will prepare a report using Word or another word processing tool. You
should submit your report on paper at the start of the class period
during which it is due. Your report should be nicely formatted and
free of grammatical errors. Your graphs should have an explanation.
All portions of the graphs should be labeled appropriately, including
the axes. You should include a key or label each curve on the graph.
Use standard axes, not logarithmic axes.
Specifically:
- Take performance runs of your naive matrix multiplication MPI
program using p = 1 and p = 4 and a range of
values of N, starting with N=16, and which are powers of 2 and lead to
run times up to a minute or two.
- Create a graph that displays your timings. The horizontal axis is
N and the vertical axis is the elapsed time. There will be two curves
on this graph, one for p = 1 and one for p = 4.
- Create a second graph that displays the parallel efficency of
your timings. The horizontal axis is N and the vertical axis is the
parallel efficiency: T_1(n)/4*T_4(n).
- Modify fox.c so that:
- Take the matrix order N from the command line instead of
stdin.
- Instead of a root process reading in A and B and sending the
blocks to the processes, each process creates the entries in it's
block of H from appropriate use of the formula 1/(i + j +
1). Note: Each block does such initialization of a block of H
once. Thereafter the algorithm must proceed without using specific
knowledge that the blocks are those of a Hilbert matrix. So it is
illegal to just create another block rather than getting it by
communication from the appropriate process!
- In main() have process zero time the call to function fox()
using calls to MPI_WTime() both before and after the execution of
the timed code.
- To limit memory problems, get rid of the global variable
temp_mat. Also be sure no process has more than 4 matrix blocks
allocated at any one time.
- To avoid printing huge matrices, modify Print_matrix so that
just the first entry in each block is printed. Thus if p=4
processes are used and N=1024, then 4 numbers are printed: C[0,0],
C[0,511], C[511,0], C[511,511].
- As above, take performance runs using p = 1 and p = 4 and a range of values
of N which are powers of 2 and lead to run times up to a minute or
two.
- Create one graph that displays your timings of Fox's
algorithm. The horizontal axis is N and the vertical axis is the
elapsed time. There will be two curves on this graph, for p = 1 and
for p = 4.
- Create a second graph that displays the parallel efficency of
your timings. The horizontal axis is n the vertical axis is the
parallel efficiency: T_1(n)/4*T_4(n).
- Create a third graph that shows two curves, the timings for the
naïve algorithm for p=4 and the timings for Fox's algorithm for p=4.
Create a report that you will turn in on paper.
Your report that you turn in will have five graphs, including
appropriate explanation of the graphs.
In a final summary paragraph, indicate if the graphs represent what
you expected, or if your results are in some way different than what
you expect. Also, indicate the number of hours that you spent doing
this part of the assignment.
Turn in for Part Two:
Email your modified verson of fox.c to Hai Nguyen, hqn01@uark.edu.
Turn in your report on paper in class at the start of class on the day
that it is due.
Enjoy!