Exercise: Do this!
Switch to the ParallelPi_VT directory on a head node (bhX/ihX) and then compile the program.
[agopu@bh2 agopu]$ cd ~/MPI_Tutorial/ParallelPi_VT/ [agopu@bh2 ParallelPi_VT]$ make
Then run the ITC instrumented code on interactive nodes you got through qsub -I (bcXX/icXX)
[agopu@bc81 agopu]$ cd ~/MPI_Tutorial/ParallelPi_VT/ [agopu@bc81 ParallelPi_VT]$ lamboot $PBS_NODEFILE [agopu@bc81 ParallelPi_VT]$ mpirun C parallelPiVt 10000 [agopu@bc81 ParallelPi_VT]$ lamhalt
Finally, open the trace file, that's created, by doing:
[agopu@bc81 ParallelPi_VT]$ traceanalyzer parallelPiVt.stf
In the following figure the ITA timeline of the parallelPiVt program with the additional ITC instrumentation is shown:
Do you see the appearance of the states we titled for-loop and print thereby increasing the resolution of our analysis?? Probably not! Because the entire computation took very little time, in the order of milliseconds. The real guts of the computation are in the for-loop when the integral is calculated. Thus, you shouuld try increasing the number of steps from 10000 to, say, 1,000,000 and see how that changes the timeline, i.e. run the program as
Exercise: Do this!
[agopu@bc81 ParallelPi_VT]$ lamboot $PBS_NODEFILE [agopu@bc81 ParallelPi_VT]$ mpirun C parallelPiVt 1000000 [agopu@bc81 ParallelPi_VT]$ lamhalt
[agopu@bc81 ParallelPi_VT]$ traceanalyzer parallelPiVt.stf
ITA's activity chart associated with the trace file parallelPiVt.stf, is shown below:
This is a detailed histogram view showing all states. The bars are in order from left to right as is the function legend from top to bottom. Here we have the ITC predefined states of the MPI function calls, the User_Code state, and the TRACE_ON state. In addition the histogram breaks out our user-defined states for-loop, main, and print. One must take care with the quantitative relationship among these states in this display. As you can see, the bar heights are to be added together to obtain the total run time on each processor. Although the main program contains all these states, clearly the bar associated with state main must represent the net time after subtracting the times of all the other states from the total.
It is again clear that too little of each processor's time is spent on actually computing the integral (for-loop) for a low number of steps in the integration.
| Previous: More on Profiling parallelPi (ITC) | Up: Table of Contents | Next: Profiling parallelDiffusion Results using ITA |
|---|