GAUSS Threading Tutorial - Part 3

Introduction to Measuring Performance

In a perfect world, the performance of a multithreaded program would always follow this simple formula:

thread_Time = sequential_time/num_CPUs;

However, the real world is more complicated. In this section of the tutorial, we will discuss three concepts that are essential for creating efficient multithreaded programs.

Amdahl's law
Granularity
Load balance

Amdahl's Law

In most cases, not every line of a program is suitable for parallelization. Amdahl's law tells us the upper limit of achievable speedup, given the percentage of a program that can be parallelized and the number of computing cores available. It can be expressed as:

max_speedup = 1/(pct_parallel/num_CPUs + (1 - pct_parallel));

If 75% of our program can be run in parallel, then the theoretical maximum speed-up on four cores is:

max_speedup = 1/(0.75/4 + (1 - 0.75)) = 2.28 = 228%

If 90% of our program can be run in parallel, we can theoretically achieve a 307% speed-up.

We can learn two important lessons from performing this calculation. First, we can gain a realistic perspective when analyzing the performance of multithreaded programs. Second, this illustrates how important it is to run as much of a program as possible in parallel and to write efficient threaded code.

Granularity

Granularity refers to the amount of work that is handed off to each thread. Since the overhead of thread creation is amortized over the life of the thread, fewer threads executing larger chunks of code will have less overhead than a larger number of threads running smaller chunks of code.

Example

As an extreme example, we will compare three versions of the same problem—the first is single-threaded, the second is efficiently threaded, and the third is inefficiently threaded. (To simplify this example, internal threading was turned off for this test run. Internal threading of GAUSS intrinsics can be turned off by setting the environment variable OMP_NUM_THREADS=1.)

//Single threaded
r = 180;
c = 180;

h = hsec;
for(1,1000,1);
  x = rndn(r,c);
  z = olsqr(x,x);
endfor;
(hsec-h)/100 "seconds";

//Efficently threaded
h = hsec;
threadbegin;
   for(1,500,1);
      x = rndn(r,c);
      z = olsqr(x,x);
   endfor;
threadend;
threadbegin;
   for(1,500,1);
      x2 = rndn(r,c);
      z2 = olsqr(x2,x2);
   endfor;
threadend;
threadjoin;
(hsec-h)/100 "seconds";

//Inefficiently threaded
//Small chunks of work are given to each thread
h = hsec;
for(1,500,1);
   threadbegin;
      x = rndn(r,c);
      z = olsqr(x,x);
   threadend;
   threadbegin;
   x2 = rndn(r,c);
   z2 = olsqr(x2,x2);
   threadend;
   threadjoin;
endfor;
(hsec-h)/100 "seconds";

Running this program on a server with 2 quad-core CPU's, produced the following timings:

Single thread 4.67 seconds
Efficiently threaded 2.39 seconds // 195% speed-up
Inefficiently threaded 2.87 seconds // 162% speed-up

The efficiently threaded program splits the for loop into two for loops that are each run in their own thread. The inefficiently threaded program creates a new thread for each iteration of the loop. As we can see from the timings, the latter is significantly slower.

Load Balance

Load balance refers to how evenly the work is distributed to each thread. The effect of distributing unbalanced workloads to parallel threads is that some of your threads will be sitting idle waiting for the threads that were given more work to do.

One case in which this may come up is when a function is called many times in a loop with data that is increasing or decreasing in size. If we use the technique from the second example above of splitting the main loop into two (or more) smaller loops, some threads will have much more work to do than the others.

If the rate of change for the size of the incoming data is known ahead of time, you can simply give a proportionally larger number of iterations to the thread that is processing the smaller amount of data. However, if you are not sure how much the size of the data will change, or for which iterations, you may need to create more threads, giving each thread a smaller number of iterations. This will increase the thread creation overhead, but in many cases, balancing the load will still provide better performance.