I asked about the parallel computation with Gauss before. I experimented with ThreadFor on a 64-core AMD  Opteron machine, and it appeared that all the 64 cores would be used for massive simulations. But now I have a 36-core Xeon machine (36 physical cores, 72 logical cores) and each time around 30% of CPU would be used. I recall that in later versions of Gauss (15 and 16), ThreadFor would automatically utilize all available cores, and this could save the chore of using ThreadJoin. So what is really going on with ThreadFor? By the way, Matlab's parfor in fact did use all the Xeon cores.

0

threadfor divides the number of iterations by the number of cores reported by the operating system. It then creates the threads and assigns the work to them.

You can check how many cores, GAUSS thinks your computer with the command:

``````n_cores = sysstate(42,0);
print n_cores[1];``````

There are a number of reasons why a particular threaded GAUSS program may see poor performance:

Oversubscription

Creating too many threads for your computer can cause very poor performance in many cases. You should check to see how many threads GAUSS is being told are on the computer, and if it is 72 (all the 'real' cores, plus the hyperthreads), you should set that to 36.

``````//Set maximum number of 'threadfor' cores to 36
original_n_cores = sysstate(42,36);``````

Also some of the functions are internally threaded, which can create even more threads at run time. GAUSS 16 caps the number of threads an internal GAUSS function can create inside of a threadfor or threadbegin/end to 2 to help minimize this problem.

Excessive memory utilization

Creating N threads will increase the memory utilization of that particular part of the algorithm by a little more than N times the original usage. If the operating system has to start swapping to disk, the computation will slow down quite a bit.

Using fn's in the threadfor loop

The manner in which fn's use global variables can cause slowdowns when combined with threadfor's creation of loop temporary variables. Any fn's should be made into proc's.

The operating system is constantly starting threads, pausing them to give other threads time-slices and restarting threads. When this happens, if the operating system decides to restart a thread on a different core than it was originally started on, the algorithm will slow down. This slowdown is particularly acute on a machine that is not just multi-core, but multi-socket (i.e. multiple physical n-core cpus'). Adjusting the thread affinity settings for your system such that threads are pinned to the cores they started on can also help quite a bit.

aptech

1,773

0

Thanks for the information provided. For my case:

1. "`n_cores = sysstate(42,0); ``print n_cores[1];`" gave 72.
2. So I believe that you were suggesting "`original_n_cores = sysstate(42,36);".`
3. Last night my simulations were stuck and I had to stop. The memory usage was around 30% on the machine which has 192G memory. Strange enough, at this level of memory usage, CPU usage (of GAUSS) dropped to 0%! And I had to kill it through task management.
4. I did use a couple of fn's instead of proc's. Will make them into proc's and see how it helps.
5. Can you clarify how to specify "thread affinity"? My machine has two Xeon E5-2699 v3 CPUs on two sockets.

Finally, some of those fine points are not documented on the manual. Is it possible to compile as a separate chapter on the manual?

Many thanks again.

0

threadfor divides the number of iterations by the number of cores reported by the operating system. It then creates the threads and assigns the work to them.

You can check how many cores, GAUSS thinks your computer with the command:

``````n_cores = sysstate(42,0);
print n_cores[1];``````

There are a number of reasons why a particular threaded GAUSS program may see poor performance:

Oversubscription

Creating too many threads for your computer can cause very poor performance in many cases. You should check to see how many threads GAUSS is being told are on the computer, and if it is 72 (all the 'real' cores, plus the hyperthreads), you should set that to 36.

``````//Set maximum number of 'threadfor' cores to 36
original_n_cores = sysstate(42,36);``````

Also some of the functions are internally threaded, which can create even more threads at run time. GAUSS 16 caps the number of threads an internal GAUSS function can create inside of a threadfor or threadbegin/end to 2 to help minimize this problem.

Excessive memory utilization

Creating N threads will increase the memory utilization of that particular part of the algorithm by a little more than N times the original usage. If the operating system has to start swapping to disk, the computation will slow down quite a bit.

Using fn's in the threadfor loop

The manner in which fn's use global variables can cause slowdowns when combined with threadfor's creation of loop temporary variables. Any fn's should be made into proc's.

The operating system is constantly starting threads, pausing them to give other threads time-slices and restarting threads. When this happens, if the operating system decides to restart a thread on a different core than it was originally started on, the algorithm will slow down. This slowdown is particularly acute on a machine that is not just multi-core, but multi-socket (i.e. multiple physical n-core cpus'). Adjusting the thread affinity settings for your system such that threads are pinned to the cores they started on can also help quite a bit.

0

Thanks for the information provided. For my case:

1. "`n_cores = sysstate(42,0); ``print n_cores[1];`" gave 72.
2. So I believe that you were suggesting "`original_n_cores = sysstate(42,36);".`
3. Last night my simulations were stuck and I had to stop. The memory usage was around 30% on the machine which has 192G memory. Strange enough, at this level of memory usage, CPU usage (of GAUSS) dropped to 0%! And I had to kill it through task management.
4. I did use a couple of fn's instead of proc's. Will make them into proc's and see how it helps.
5. Can you clarify how to specify "thread affinity"? My machine has two Xeon E5-2699 v3 CPUs on two sockets.

Finally, some of those fine points are not documented on the manual. Is it possible to compile as a separate chapter on the manual?

Many thanks again.