As you'd done earlier in the case of roundRobin example, you will need to append the -tv option to mpirun before specifying your program's name. So in this case, you should type:
Exercise: Do this!
[agopu@bc81 agopu]$ cd ~/MPI_Tutorial/ParallelDiffusion/ [agopu@bc81 ParallelDiffusion]$ lamboot $PBS_NODEFILE [agopu@bc81 ParallelDiffusion]$ mpirun C -tv parallelDiffusion 8000 1000 100 [agopu@bc81 ParallelDiffusion]$ lamhalt
Once again, hit Go in the process window with the Group (Control) chosen and then stop the parallel program at its default breakpoint by answering Yes to the TV dialog box described in the above link, the root and process windows should look like this:


First let's run all four processors through half of the variable initializations and check for possible problems.
main,
inside the Stack Trace pane. Then, assuming you have not modified the source code, set a breakpoint at the position indicated in red below (line 124):
/* Set uniform initial conditions on my portion of the grid. */
for ( i = 1; i <= mynni; i++ )
This breakpoint will stop execution after computing dt in line 121.
Now let us check some variables.
In the Stack Frame dive on the local variable myStarti i.e. double click on the variable myStarti within the Source Code pane or within the Stack Frame pane. When the Variable Window appears, pull down on View -> Laminate and select Process. Laminate will show the values of myStarti for all four processes:
Click on the text box within the above variable window where it shows myStarti and replace it with myEndi , hit return. The new Variable window should look like:
Comparing the two windows we have processor 0 working on i = 1-2000, processor 1 working on i = 2001-4000, and so forth. These values indeed correspond with what we had planned in Domain decomposition and appear correct.
As an additional check let us check the values of mynni too. To this effect, replace myEndi with mynni on the text box within the variable window and hit return. The third Variable window should look like:
Each processor has been assigned 2000 x-points, exactly as we had wanted. So it appears that data decomposition is correctly functioning, and our bug must lie elsewhere. You may now close the Variable Window(s).
You could also try diving on local variables leftProc and rightProc and laminate to verify if these look right.
OK, time to move on...
Remove the above break point by clicking on the STOP symbol.
Then place a breakpoint on the line shown in red (line 135 if code is unchanged)
/* First the y = 0 boundary. */
for ( i = 0; i <= mynni+1; i++ )This breakpoint will allow us to examine the temperature array u[][] subsequent to the setting of the initial conditions. Make sure you're toggled to Group (Control), then hit Go. Check to make sure all four processes have reached this second breakpoint (you could use the P+ and P- buttons and check if all processes are at the same line).
Among the local variables on the Stack Frame dive on the temperature array u.
Important: It is usuallynot convenient to laminate in case of a double array; rather it's easier to examine u for process 0 (otherwise there will be way too much data). You might be disappointed to see only:
And if you dive into this (by double clicking twice on the variable pointers shown) you will eventually view only the value of u[0][0], which is not very helpful.
But notice at the top of the window the Type is double** i.e. variable u is of type double** or in simple terms a memory pointer to a memory pointer to a long contiguous chunk of memory. This is illustrated below for the benefit of those unfamiliar with 'C' pointers:
----------- ----------- ----------- -----------
u[0][0] -> |Mem.loc.x_1| -> |Mem.loc.x_2| -> |Mem.loc.x_3| -> ... ... -> |Mem.loc.x_n|
----------- ----------- ----------- -----------
where x_1, x_2, ... x_n are contiguous memory locations and x_n
is 2002 x 1002 x size_of_double for each process in our example.
Note: When displaying normal C or C++ variable, TotalView data types are identical to C/C++ type representations. But for pointers to arrays you need to cast a C/C++ pointer-to-array to a TotalView pointer-to-array if you wish to view its contents. On the other hand, there is no need for such conversion while doing Fortran coding for any F datatype. See the TotalView manual for more information on this.
Refer to the parallelDiffusion.c source code to review how we allocated the 2D array u. We allocated one big contiguous chunk of memory for the 2D array. In our test run, we had used 8000 as the first parameter and had used 4 processors (2 nodes) which meant each process worked on 2000 elements. So we can cast u to a TotalView type double[2002][1002]** which is read in reverse as a pointer to a pointer to a 2002 x 1002 array. With this cast TotalView will understand how to display the C array; But how is the casting done?
In the Variable Window left-click on the text box showing double**. A cursor should appear; Using your keyboard, modify double** to read double[2002][1002]**; then hit return.
Still working in the Variable Window, dive twice on the array i.e. double-click or right-click and select Dive on the value shown in the Value pane within the Variable window. Now...you should see processor 0's u in its initialized state as shown below:
As you would expect, there are 2,002 x 2,002 = 4,008,004 elements in this array, way too many to make any sense out of! So pick and choose what data to view. Recall our erroneous result was produced along the j = 10 margin. So restrict yourself to only look at data for j = 0 .
In the Variable Window left-click and release on the Slice field. Use the keyboard to enter slice parameters:
[:][10:10]
and hit return. This will restrict the data in the Variable Window to the j = 10 slice. The resulting variable window should look similar to:
Note: More information on slicing can be found on the TotalView manual.
But isn't 2002 entries still too much to scroll through? Would it not be nice to plot the data?
. . .
. . . TotalView lets you do that!
In the Variable Window pull down the Tools menu and select Visualize. After a moment a 2D plot should appear. In the Plot window pull down on File and select Options. In the resulting dialog box select Points and deselect Lines, then click OK. Your plot should look like:
And now you should be able to see the problem! Do you?
The initialized temperature is zero at u[2001][10], when in fact it should equal 1.
Along the j = 10 slice the only boundary points that should be set to zero are u[0][10] and u[8001][10]. The point at [2001][10] is one of processor 0's ghost points which should not have been set to zero.
Some Asides::
If this is not obvious to you,
then close the visualization windows; edit the slicing text box to [1950:2002][10:10]
and then redo the visualization per the steps outlines in the above bullet. Also, as a pure do-it-yourself kind of exercise, if you are interested in playing
around with the Visualization tool, try resetting the
slice to [:][:], the redo the Visualize step given in the previous bullet; Then try doing:
View -> Surface on the smaller Visualization window (it shows "visualize"
on the title bar of the window). You could possibly use the surface visualization tool
at some point of time? We could repeat the investigation by looking at the other processors,
and rest assured, we will find the same sort of error. Apparently when we coded the
setting of the initial conditions we coded exactly as we did in the serial version.
But doing so proved to be wrong, because we had thereby neglected
to set the ghost points that appear only when dealing with the parallel version.
The snippet of code we need to investigate lies between our two breakpoints:
/* Set uniform initial conditions on my portion of the grid. */
for ( i = 1; i <= mynni; i++ ) {
for ( j = 1; j <= nnj; j++ ) {
u[i][j] = u_0;
}
}
To initialize the ghost points, the for-loop over i must expand to include them:
for ( i = 0; i <= mynni+1; i++ )
^^^ ^^^
However, for processes 0 and 3 this loop will set the i = 0 and i = nni + 1 boundaries to 1, when in fact we want those boundaries to be fixed at 0. Does this present a problem?
. . .
. . .
No! How is it not a problem? Think about it..
You probably guessed it right. Because the subsequent code, just beyond the second breakpoint, resets those boundaries to 0 while leaving the ghost points at 1.
The corrected code snippet with comments is shown below:
/* Set uniform initial conditions on my portion of the grid. */
/* Also set my ghost points. These are grid points not owned by */
/* me but that I need to comput my stencils. If I am processor 0 */
/* or processor numProcs-1, then half of these ghost points will */
/* be reset later below during the setting of the boundary conditions. */
for ( i = 0; i <= mynni+1; i++ ) {
for ( j = 1; j <= nnj; j++ ) {
u[i][j] = u_0;
}
}
Once you fix the code, recompile your code on a head node (bhX/ihX) by doing:
[agopu@bh2 agopu]$ cd ~/MPI_Tutorial/ParallelDiffusion/ [agopu@bh2 ParallelDiffusion]$ make clean [agopu@bh2 ParallelDiffusion]$ makeand re-run the program on a compute node gotten through
qsub -I (bcXX/icXX). You should see correct results this time around!
[agopu@bc81 agopu]$ cd ~/MPI_Tutorial/ParallelDiffusion/ [agopu@bc81 ParallelDiffusion]$ lamboot $PBS_NODEFILE [agopu@bc81 ParallelDiffusion]$ mpirun C parallelDiffusion 8000 1000 100 [agopu@bc81 ParallelDiffusion]$ lamhalt After 100 time steps some results are: actual u[400][10] = 0.874737 computed u[400][10] = 0.873978 After 100 time steps some results are: actual u[1000][10] = 0.874806 computed u[1000][10] = 0.873978
Hopefully, this example has given you a good idea of how you could use TotalView to debug your parallel code!
| Previous: Introducing TotalView | Up: Table of Contents | Next: Conclusion |
|---|