Loops and multithreading

Question

I've been reading about multithreading ability in GAUSS and would like to use it to speed up simple loops, but I'm not sure how to go about it. A simple example would be something like

n=100; // number of times to loop
y=zeros(n,1); // holds results
for j (1,n,1);
   x = rndu(10,1); // some data to analyze
   y[j,1] = somefunction(x);
endfor;

Execution could be speeded up if the statement in the loop were to run as independent threads. This is illegal, however, as all threads try to write to the same global variable y.

I understand that I could repeat code blocks in the loop, such as

for j (1,n,1);
   // 1st thread
   x = rndu(10,1); // some data to analyze
   y1[j,1] = somefunction(x);

   // 2nd thread
   x = rndu(10,1); // more data to analyze
   y2[j,1] = somefunction(x);
endfor;

y = y1|y2;

This can be threaded....but it means I need to write out the same block of code for every core on my machine. I have access to a 64 core machine, so I'm hoping for something more elegant.

Any suggestions?

3 Answers

Your Answer

Aptech · Answer 1

If you are accessing a normal matrix (i.e. not a string array, etc), you should be able to write to different elements of the same matrix. However, for performance you are generally best off to keep the data that is written to by different threads some distance away from each other and to give each thread more work to do.

This is because each CPU on your computer has separate cache memory. Each CPU reads data into cache in chunks of data called "cache lines". When one CPU writes to a cache line, it notifies the other CPU's which cache line(s) it has written to. The other CPU's will consider this cache line "dirty" which can require a reloading of the data. Loading data is (relative to other CPU operations) very, very slow. This can lead to a phenomenon called "cache thrashing" in which your threads spend much of their time reloading data written to by other threads and can make your code very slow.

User specified GAUSS threads are meant for "coarse parallelization". GAUSS automatically carries out the finer level of multi-threading inside of the intrinsic functions.

Since GAUSS automatically threads many functions internally, code that does not use any explicit GAUSS threading statements will still take advantage of multiple cores. For example a matrix multiplication or linear solve may use 4-8 threads (or more) depending upon system resources and the size of the matrix. Therefore, you can use many cores with just a few GAUSS level threading statements.

In most cases you will be best off by creating a smaller number of blocks like this:

n=100; // number of times to loop
nthreads = 2;
y1=zeros(n/nthreads,1); // holds results
y2=zeros(n/nthreads,1); // holds results

threadBegin;
   for j (1,n/nthreads,1);
      x = rndu(10,1); // some data to analyze
      y1[j,1] = somefunction(x);
   endfor;
threadEnd;
threadBegin;
   for j (1,n/nthreads,1);
      x = rndu(10,1); // some data to analyze
      y2[j,1] = somefunction(x);
   endfor;
threadEnd;

y = y1|y2;

This example shows just two blocks for the sake of explanation. But scaling to 4 would not be too hard. The copy and paste is, admittedly, not wonderful. But the code should avoid the memory issues discussed above and use many more than two threads (considering the automatic threading in GAUSS).

link

Aptech

70

SvanNorden · Answer 2

That's very interesting.

Do you have any practical advice for users seeking to write simple for loops to take advantage of multiple processors? I'm sure problems of this type are frequently encountered by users.

link

SvanNorden

5

Aptech · Answer 3

The best practical advice would be: GAUSS level threads take some time to create and coordinate. On a recent Linux machine with an intel quadcore processor, this was timed at about 0.00009 seconds per thread create. Try and make sure that any code that you execute in a separate GAUSS level thread will take at least 0.01 seconds to execute in order to achieve good thread efficiency.

For threading at a finer level than that, the internal GAUSS threads are already handling that.

link

Aptech

70

Aptech · Answer 4

If you are accessing a normal matrix (i.e. not a string array, etc), you should be able to write to different elements of the same matrix. However, for performance you are generally best off to keep the data that is written to by different threads some distance away from each other and to give each thread more work to do.

This is because each CPU on your computer has separate cache memory. Each CPU reads data into cache in chunks of data called "cache lines". When one CPU writes to a cache line, it notifies the other CPU's which cache line(s) it has written to. The other CPU's will consider this cache line "dirty" which can require a reloading of the data. Loading data is (relative to other CPU operations) very, very slow. This can lead to a phenomenon called "cache thrashing" in which your threads spend much of their time reloading data written to by other threads and can make your code very slow.

User specified GAUSS threads are meant for "coarse parallelization". GAUSS automatically carries out the finer level of multi-threading inside of the intrinsic functions.

Since GAUSS automatically threads many functions internally, code that does not use any explicit GAUSS threading statements will still take advantage of multiple cores. For example a matrix multiplication or linear solve may use 4-8 threads (or more) depending upon system resources and the size of the matrix. Therefore, you can use many cores with just a few GAUSS level threading statements.

In most cases you will be best off by creating a smaller number of blocks like this:

n=100; // number of times to loop
nthreads = 2;
y1=zeros(n/nthreads,1); // holds results
y2=zeros(n/nthreads,1); // holds results

threadBegin;
   for j (1,n/nthreads,1);
      x = rndu(10,1); // some data to analyze
      y1[j,1] = somefunction(x);
   endfor;
threadEnd;
threadBegin;
   for j (1,n/nthreads,1);
      x = rndu(10,1); // some data to analyze
      y2[j,1] = somefunction(x);
   endfor;
threadEnd;

y = y1|y2;

This example shows just two blocks for the sake of explanation. But scaling to 4 would not be too hard. The copy and paste is, admittedly, not wonderful. But the code should avoid the memory issues discussed above and use many more than two threads (considering the automatic threading in GAUSS).

link

Aptech

70

SvanNorden · Answer 5

That's very interesting.

Do you have any practical advice for users seeking to write simple for loops to take advantage of multiple processors? I'm sure problems of this type are frequently encountered by users.

link

SvanNorden

5

Aptech · Answer 6

The best practical advice would be: GAUSS level threads take some time to create and coordinate. On a recent Linux machine with an intel quadcore processor, this was timed at about 0.00009 seconds per thread create. Try and make sure that any code that you execute in a separate GAUSS level thread will take at least 0.01 seconds to execute in order to achieve good thread efficiency.

For threading at a finer level than that, the internal GAUSS threads are already handling that.

link

Aptech

70

Loops and multithreading

3 Answers

Your Answer

3 Answers

You must login to post answers.

Have a Specific Question?

Need Support?