Code optimization for use in high performance/high throughput computing



Our lab group recently purchased a floating network license to use GAUSS on the high performance/high throughput computing system at the University of Arizona. Having tried a few test runs, I found that the performance does not improve with increasing number of CPU cores unless I also increase the number of threads in my code. I tried running code with 10 threads of the main loop and 2 threads in a boostrapping procedure invoked in each major loop. When running it in a MPI system, there is no improvement when the number of cores exceeds 10. When I tried to run the same code in a SMP system, the same pattern exists: using more than 10 cores does not improve performance. Neither did I see any improvement by using 2 nodes, each with 10 cores. Moreover, it seems running the same code on the SMP system requires more memory per core than on the MPI system. When using the MPI system, it is sufficient to assign slightly less than 2 GB memory for each core. However, when using the SMP system, I often received the error message of insufficient memory if I assigned 2 GB memory for each core. I had to increase the per core memory to 4 GB for a successful run of the code. Unfortunately, there are only a very limited number of nodes in the SMP system that has more than 2GB memory for each core. Increasing the memory assigned to each core means I have to wait for a longer period of time for those nodes to become available.

My first question is whether there are other ways to improve the performance of my code by only increasing the number of nodes or cores I can have from either the MPI or SMP system without increasing the number of threads.  My second question is what causes the same code to require more memory in the SMP system than in the MPI system. If it always require more memory in the SMP system, is there any advantage to choose SMP over MPI system?

Thanks you!

asked May 6, 2014

3 Answers


How to take advantage of large numbers of cores
Ultimately to take advantage of a larger number of cores, you will need a larger number of threads. There are two ways that this can be accomplished. The most obvious way of using more threads is by explicitly creating them with threadBegin, threadStat, etc.

The second way to make your GAUSS program use a greater number of threads is (when possible) to change element-by-element operations and loops into vectorized statements. A simple, but illustrative example is the dot product of two vectors. You can perform this calculation in GAUSS like this:

x = rndn(1e6, 1);
y = rndn(1e6, 1);
z = 0;

//Calculate dot product
for i(1, rows(x), 1);
   z = z + x[i].*y[i];

or you could simplify it to this:

x = rndn(1e6, 1);
y = rndn(1e6, 1);
z = x'*y;

You could certainly add threading statements to the first code snippet above. If the vectors were long enough, you could get a good speedup from a small number of threads.

The second code snippet, however, will be automatically threaded by GAUSS “under the hood”. These automatic threads have much less overhead than the GAUSS level threading commands. They also adjust automatically to the number of cores on your machine, the size of your data and the calculation being performed.

Threading limitations
GAUSS does not place any limit on the number of threads that you can create. As you increase the number of threads for a given problem, however, the overhead of thread creation will become an increasing portion of the overall run time. Therefore, any particular calculation can only scale up to a certain point.

What to do?
Most likely there are some solid performance improvements to be had by vectorizing your code.

Why would SMP use more memory than MPI?
Given the number of unknowns and not having seen the code, it is hard to say. For the most part, GAUSS should not require more memory on a shared memory machine. However, it might have seen more cores available on the shared memory machine and create more internal threads which would take more memory.


Many thanks for this very helpful response. Unfortunately, there is very little change I can make to further vectorize the statement. So I have to accept the fact that the only way to further improve the performance is to increase the number of threads, which comes at the cost of increase in overhead.

I do have another general question. Given the architecture of the GAUSS program, does GAUSS code generally perform better on a SMP than a MPI system? So far, I have seen mixed results in my test runs. If GAUSS does indeed perform better on a SMP system, then there is a strong reason for me to choose SMP over MPI despite the need for increased memory.


The GAUSS threading statements use either pthreads or OMP, so they are best suited to shared memory systems.