Aptech Systems, Inc. Worldwide Headquarters
Aptech Systems, Inc.
2350 East Germann Road, Suite #21
Chandler, AZ 85286
Ready to Get Started?
For Pricing and Distribution
Training & Events
Step-by-step, informative lessons for those who want to dive into GAUSS and achieve their goals, fast.
Have a Specific Question?
Q&A: Register and Login
Premier Support and Platinum Premier Support are annually renewable membership programs that provide you with important benefits including technical support, product maintenance, and substantial cost-saving features for your GAUSS System or the GAUSS Engine.
Join our community to see why our users are considered some of the most active and helpful in the industry!
Where to Buy
Available across the globe, you can have access to GAUSS no matter where you are.
Recent Tagsapplications character vectors CML CMLMT Constrained Optimization datasets dates dlibrary dllcall error error handling errors Excel FANPACMT file i/o floating network GAUSS Engine graphics GUI hotkeys installation Java API license licensing linux loading data matrices matrix matrix manipulation Maxlik MaxLikMT Memory optimization Optmum output PQG graphics procs RAM random numbers string functions strings structures threading Time Series writing data
Time Series 2.0 MT
Find out more now
Time Series MT 2.1
Code optimization for use in high performance/high throughput computing
Our lab group recently purchased a floating network license to use GAUSS on the high performance/high throughput computing system at the University of Arizona. Having tried a few test runs, I found that the performance does not improve with increasing number of CPU cores unless I also increase the number of threads in my code. I tried running code with 10 threads of the main loop and 2 threads in a boostrapping procedure invoked in each major loop. When running it in a MPI system, there is no improvement when the number of cores exceeds 10. When I tried to run the same code in a SMP system, the same pattern exists: using more than 10 cores does not improve performance. Neither did I see any improvement by using 2 nodes, each with 10 cores. Moreover, it seems running the same code on the SMP system requires more memory per core than on the MPI system. When using the MPI system, it is sufficient to assign slightly less than 2 GB memory for each core. However, when using the SMP system, I often received the error message of insufficient memory if I assigned 2 GB memory for each core. I had to increase the per core memory to 4 GB for a successful run of the code. Unfortunately, there are only a very limited number of nodes in the SMP system that has more than 2GB memory for each core. Increasing the memory assigned to each core means I have to wait for a longer period of time for those nodes to become available.
My first question is whether there are other ways to improve the performance of my code by only increasing the number of nodes or cores I can have from either the MPI or SMP system without increasing the number of threads. My second question is what causes the same code to require more memory in the SMP system than in the MPI system. If it always require more memory in the SMP system, is there any advantage to choose SMP over MPI system?
How to take advantage of large numbers of cores
Ultimately to take advantage of a larger number of cores, you will need a larger number of threads. There are two ways that this can be accomplished. The most obvious way of using more threads is by explicitly creating them with threadBegin, threadStat, etc.
The second way to make your GAUSS program use a greater number of threads is (when possible) to change element-by-element operations and loops into vectorized statements. A simple, but illustrative example is the dot product of two vectors. You can perform this calculation in GAUSS like this:
x = rndn(1e6, 1); y = rndn(1e6, 1); z = 0; //Calculate dot product for i(1, rows(x), 1); z = z + x[i].*y[i]; endfor;
or you could simplify it to this:
x = rndn(1e6, 1); y = rndn(1e6, 1); z = x'*y;
You could certainly add threading statements to the first code snippet above. If the vectors were long enough, you could get a good speedup from a small number of threads.
The second code snippet, however, will be automatically threaded by GAUSS “under the hood”. These automatic threads have much less overhead than the GAUSS level threading commands. They also adjust automatically to the number of cores on your machine, the size of your data and the calculation being performed.
GAUSS does not place any limit on the number of threads that you can create. As you increase the number of threads for a given problem, however, the overhead of thread creation will become an increasing portion of the overall run time. Therefore, any particular calculation can only scale up to a certain point.
What to do?
Most likely there are some solid performance improvements to be had by vectorizing your code.
Why would SMP use more memory than MPI?
Given the number of unknowns and not having seen the code, it is hard to say. For the most part, GAUSS should not require more memory on a shared memory machine. However, it might have seen more cores available on the shared memory machine and create more internal threads which would take more memory.
Many thanks for this very helpful response. Unfortunately, there is very little change I can make to further vectorize the statement. So I have to accept the fact that the only way to further improve the performance is to increase the number of threads, which comes at the cost of increase in overhead.
I do have another general question. Given the architecture of the GAUSS program, does GAUSS code generally perform better on a SMP than a MPI system? So far, I have seen mixed results in my test runs. If GAUSS does indeed perform better on a SMP system, then there is a strong reason for me to choose SMP over MPI despite the need for increased memory.
The GAUSS threading statements use either pthreads or OMP, so they are best suited to shared memory systems.