Aptech Systems, Inc. Worldwide Headquarters
Aptech Systems, Inc.
2250 East Germann Road, Suite #10
Chandler, AZ 85286
Ready to Get Started?
For Pricing and Distribution
Training & Events
Step-by-step, informative lessons for those who want to dive into GAUSS and achieve their goals, fast.
Have a Specific Question?
Q&A: Register and Login
Premier Support and Platinum Premier Support are annually renewable membership programs that provide you with important benefits including technical support, product maintenance, and substantial cost-saving features for your GAUSS System or the GAUSS Engine.
Join our community to see why our users are considered some of the most active and helpful in the industry!
Where to Buy
Available across the globe, you can have access to GAUSS no matter where you are.
Recent Tagsapplications character vectors CMLMT Constrained Optimization covariance matrix datasets dates dlibrary dllcall Editor error error handling errors Excel FANPACMT file i/o GAUSS Engine graphics GUI hardware histogram hotkeys installation Java API linux loading data localization loops Matlab convert matrices matrix manipulation Maxlik MaxLikMT Optmum output pgraph graph PQG graphics procs random numbers simulation string functions strings threading Time Series writing data
Time Series 2.0 MT
Find out more now
Time Series MT 2.1
Loops and multithreading
I’ve been reading about multithreading ability in GAUSS and would like to use it to speed up simple loops, but I’m not sure how to go about it. A simple example would be something like
n=100; // number of times to loop y=zeros(n,1); // holds results for j (1,n,1); x = rndu(10,1); // some data to analyze y[j,1] = somefunction(x); endfor;
Execution could be speeded up if the statement in the loop were to run as independent threads. This is illegal, however, as all threads try to write to the same global variable y.
I understand that I could repeat code blocks in the loop, such as
for j (1,n,1); // 1st thread x = rndu(10,1); // some data to analyze y1[j,1] = somefunction(x); // 2nd thread x = rndu(10,1); // more data to analyze y2[j,1] = somefunction(x); endfor; y = y1|y2;
If you are accessing a normal matrix (i.e. not a string array, etc), you should be able to write to different elements of the same matrix. However, for performance you are generally best off to keep the data that is written to by different threads some distance away from each other and to give each thread more work to do.
This is because each CPU on your computer has separate cache memory. Each CPU reads data into cache in chunks of data called “cache lines”. When one CPU writes to a cache line, it notifies the other CPU’s which cache line(s) it has written to. The other CPU’s will consider this cache line “dirty” which can require a reloading of the data. Loading data is (relative to other CPU operations) very, very slow. This can lead to a phenomenon called “cache thrashing” in which your threads spend much of their time reloading data written to by other threads and can make your code very slow.
User specified GAUSS threads are meant for “coarse parallelization”. GAUSS automatically carries out the finer level of multi-threading inside of the intrinsic functions.
Since GAUSS automatically threads many functions internally, code that does not use any explicit GAUSS threading statements will still take advantage of multiple cores. For example a matrix multiplication or linear solve may use 4-8 threads (or more) depending upon system resources and the size of the matrix. Therefore, you can use many cores with just a few GAUSS level threading statements.
In most cases you will be best off by creating a smaller number of blocks like this:
n=100; // number of times to loop nthreads = 2; y1=zeros(n/nthreads,1); // holds results y2=zeros(n/nthreads,1); // holds results threadBegin; for j (1,n/nthreads,1); x = rndu(10,1); // some data to analyze y1[j,1] = somefunction(x); endfor; threadEnd; threadBegin; for j (1,n/nthreads,1); x = rndu(10,1); // some data to analyze y2[j,1] = somefunction(x); endfor; threadEnd; y = y1|y2;
This example shows just two blocks for the sake of explanation. But scaling to 4 would not be too hard. The copy and paste is, admittedly, not wonderful. But the code should avoid the memory issues discussed above and use many more than two threads (considering the automatic threading in GAUSS).
That’s very interesting.
Do you have any practical advice for users seeking to write simple for loops to take advantage of multiple processors? I’m sure problems of this type are frequently encountered by users.
The best practical advice would be: GAUSS level threads take some time to create and coordinate. On a recent Linux machine with an intel quadcore processor, this was timed at about 0.00009 seconds per thread create. Try and make sure that any code that you execute in a separate GAUSS level thread will take at least 0.01 seconds to execute in order to achieve good thread efficiency.
For threading at a finer level than that, the internal GAUSS threads are already handling that.