Posted by Jason Polak on 01. June 2018 · Write a comment · Categories: computer-science · Tags: ,

This is a continuation of the OpenMP tutorial. Last time we wrote the following program:

It gives a proof that the number $N = 770948605618065977$ is prime by testing all the possible factors up to the integer part of the square root of $N$. To recap, this program creates two threads and tests half the possible factors on one thread, and half the possible factors on another thread. Wouldn't it be nice to have the program split the loop work up automatically instead of having to do it manually?

That is indeed possible! So let's examine how we do that. Here is a new main function that will do the splitting automatically:

There are a couple of things to notice here: we've removed the code-block parentheses that encapsulated the parallel code. Remember, that usually when we want to run the same code segment in a bunch of spawned threads, we use:

However, the new preprocessor directive #pragma omp parallel for means: take the next for loop and run that in all threads, dividing the work amongst each threads. It automatically encloses the for loop into a parallel block. Pretty nifty, right? If you run this code, you should get about the same running time as before.

However, notice another thing: I removed the break statement from our for loop. Technically, I knew $N$ was prime so we didn't need to put that there. But if we replaced $N$ by a composite number, it would be nice to break out of the loop if we did find a factor, to avoid wasting time. However, when you just write an omp parallel for directive, you can no longer just break out of a loop split amongst threads because of the way OpenMP does that automatically. Therefore, there is some flexibility lost if you just use a straightforward omp parallel for. In the case of proving $N$ to be prime, this doesn't matter because every factor up to the square root of $N$ must be tested.

Another Example

Let's try another example of using automatic worksharing in a for loop. Let's sum the numbers from 1 to 100:

Of course why would we calculate the sum this way? We know that there's a better way to calculate this sum: $(1 + 100)\cdot 100/2 = 5050$. No need to use a program. But after all, this is just an example, so let's do it anyway. Let's recall how to compile this program, assuming it is in a file called sum.c:

Now we've got a shiny new binary called sum. Let's run it:

Oh my goodness! What the devil is going on here? That's not the right answer! Moreover, try running this program a bunch of different times. I get numbers like 2372, 3218, … looks like a giant HOLE in the logic of the universe! Quantum annihilation!

Just joking. What's happening here is that in the loop, the sum variable in different threads has different local copies and you're just accessing one of them at the end. That's kind of weird if you think about it, because the sum variable is just one variable declared outside the parallel code block. However, that's the way parallel computing works.

What you really want to do is take a local sum variable from each thread and combine them all together at the end. This is called reduction, in parallel computing terms. There is a special syntax and the corresponding new main function goes like this:

It's simple: after the parallel for in the pragma, append

The keyword reduction means take all the local copies. Local copies of what? The stuff in parentheses is what. (+:sum) says combine all the local copies (with addition) of all the variables in the list following the colon. In this case, we just want the sum variable. If you need more local copies added up, you would use

This would add up all the local copies per thread for the variables sum and sum2 (our program didn't have a sum2 variable, but if it did, this is what we'd use). Here is an example to calculate a product:

Again, without the reduction, the correct product may not be calculated. However, if you are taking the product over a very small set of elements (say calculating the factorial of ten), then the result will probably all be executed in a single thread. Obviously, in the real world, you'd be doing much larger computations.

Finally, if you happen to need a bunch of thread-copies of variables collected under different operations, the syntax is like this:

In general, the operators you can use are:

  1. Arithmetic ones: +,*,-
  2. Logical ones: &, |, ^, &&, ||

Summary of Lesson

In short:

  1. To parallelize for loops, precede them by:
  2. If tallying any of the results into a variable, use reductions
  3. More flexibility can still be obtained with the usual #pragma omp parallel block and obtaining thread ID's manually. In this case you probably will need the following functions defined in omp.h: omp_get_thread_num(); and omp_get_num_threads();

Leave a Reply

Your email address will not be published. Required fields are marked *