GAUSS Threading Tutorial – Part 3

Measuring Performance

In a perfect world, the performance of a multithreaded program would always follow this simple formula:

thread_Time = sequential_Time/num_CPUs;

However, the real world is more complicated. In this section of the tutorial, we will discuss three concepts that are essential for creating efficient multithreaded programs.

  1. Amdahl’s law
  2. Granularity
  3. Load balance

Amdahl’s Law

In most cases, not every line of a program is suitable for parallelization. Amdahl’s law tells us the upper limit of achievable speedup, given the percentage of a program that can be parallelized and the number of computing cores available. It can be expressed as:

max_SpeedUp = 1/(pct_Parallel/num_CPUs + (1 - pct_Parallel));

If 75% of our program can be run in parallel, then the theoretical maximum speed-up on four cores is:

max_SpeedUp = 1/(0.75/4 + (1 - 0.75)) = 2.28 = 228%

If 90% of our program can be run in parallel, we can theoretically achieve a 307% speed-up.

At first glance that may be disappointing. However, remember that this is not a percentage of your total number of lines of code. In many programs, a small minority of the lines of code take up the majority of the execution time.

We can learn two important lessons from performing this calculation. First, we can gain a realistic perspective when analyzing the performance of multithreaded programs. Second, this illustrates how important it is to run as much of a program as possible in parallel and to write efficient threaded code.

Granularity

Granularity refers to the amount of work that is handed off to each thread. Since the overhead of thread creation is amortized over the life of the thread, fewer threads executing larger chunks of code will have less overhead than a larger number of threads running smaller chunks of code.

Example:

As an extreme example, we will compare three versions of the same problem—the first is single-threaded, the second is efficiently threaded, and the third is inefficiently threaded. (To simplify this example, internal threading was turned off for this test run. Internal threading of GAUSS intrinsics can be turned off by setting the environment variable OMP_NUM_THREADS=1.)

r = 180;
c = 180;

// Single thread
h = hsec;
for(1,1000,1);
  x = rndn(r,c);
  z = olsqr(x,x);
endfor;
(hsec-h)/100 "seconds";
// Efficiently threaded
h = hsec;
threadbegin;
   for(1,500,1);
      x = rndn(r,c);
      z = olsqr(x,x);
   endfor;
threadend;
threadbegin;
   for(1,500,1);
      x2 = rndn(r,c);
      z2 = olsqr(x2,x2);
   endfor;
threadend;
threadjoin;
(hsec-h)/100 "seconds";
// Inefficiently threaded
h = hsec;
for(1,500,1);
   threadbegin;
      x = rndn(r,c);
      z = olsqr(x,x);
   threadend;
   threadbegin;
   x2 = rndn(r,c);
   z2 = olsqr(x2,x2);
   threadend;
   threadjoin;
endfor;
(hsec-h)/100 "seconds";

Running this program on a server with 2 quad-core CPU’s, produced the following timings:

Single thread 4.67 seconds
Efficiently threaded 2.39 seconds // 195% speed-up
Inefficiently threaded 2.87 seconds // 162% speed-up

The efficiently threaded program splits the for loop into two for loops that are each run in their own thread. The inefficiently threaded program creates a new thread for each iteration of the loop. As we can see from the timings, the latter is significantly slower.

While this is an extreme example, it illustrates the importance of considering granularity in your multi-threaded program.

Load Balance

Load balance refers to how evenly the work is distributed to each thread. The effect of distributing unbalanced workloads to parallel threads is that some of your threads will be sitting idle waiting for the threads that were given more work to do.

One case in which this may come up is when a function is called many times in a loop with data that is increasing or decreasing in size. If we use the technique from the second example above of splitting the main loop into two (or more) smaller loops, some threads will have much more work to do than the others.

If the incoming data’s rate of change is known ahead of time, you can simply give a proportionally larger number of iterations to the thread that is processing the smaller amount of data. However, the incoming data’s rate of change may not be known ahead of time. In that case, you may need to create more threads, giving each thread a smaller number of iterations. This will increase the thread creation overhead, but in many cases, balancing the load will offset the extra overhead.