This is a continuation of the OpenMP tutorial. Last time we wrote the following program:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
#include <stdio.h> #include <math.h> #include <omp.h> int main(int argc, char** argv){ long int N = 770948605618065977; printf("Number to be tested: %ld\n", N); long int upper = (long int) sqrt(N); omp_set_num_threads(2); #pragma omp parallel { int ID = omp_get_thread_num(); int start,stop; if (ID == 0){ start = 2; stop = 439018395; } else{ start = 439018396; stop = upper; } for (long int j = start; j <= stop; j++){ if (N%j == 0){ printf("Factor found: %ld\n", j); break; } } } return(0); } |

It gives a proof that the number $N = 770948605618065977$ is prime by testing all the possible factors up to the integer part of the square root of $N$. To recap, this program creates two threads and tests half the possible factors on one thread, and half the possible factors on another thread. Wouldn't it be nice to have the program split the loop work up automatically instead of having to do it manually?

That is indeed possible! So let's examine how we do that. Here is a new **main** function that will do the splitting automatically:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
int main(int argc, char** argv){ long int N = 770948605618065977; printf("Number to be tested: %ld\n", N); long int upper = (long int) sqrt(N); omp_set_num_threads(2); #pragma omp parallel for for (long int j = 2; j <= upper; j++){ if (N%j == 0){ printf("Factor found: %ld\n", j); } } return(0); } |

There are a couple of things to notice here: we've removed the code-block parentheses that encapsulated the parallel code. Remember, that usually when we want to run the same code segment in a bunch of spawned threads, we use:

1 2 3 4 |
#pragma omp parallel { //code to be executed in all threads goes here } |

However, the new preprocessor directive **#pragma omp parallel for** means: take the next for loop and run that in all threads, dividing the work amongst each threads. It automatically encloses the **for** loop into a parallel block. Pretty nifty, right? If you run this code, you should get about the same running time as before.

However, notice another thing: I removed the **break** statement from our for loop. Technically, I *knew* $N$ was prime so we didn't need to put that there. But if we replaced $N$ by a composite number, it would be nice to **break** out of the loop if we did find a factor, to avoid wasting time. However, when you just write an **omp parallel for** directive, you can no longer just break out of a loop split amongst threads because of the way OpenMP does that automatically. Therefore, there is some flexibility lost if you just use a straightforward **omp parallel for**. In the case of proving $N$ to be prime, this doesn't matter because *every* factor up to the square root of $N$ must be tested.

## Another Example

Let's try another example of using automatic worksharing in a for loop. Let's sum the numbers from 1 to 100:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
#include <stdio.h> #include <omp.h> int main(int argc, char** argv){ int max = 100; int sum = 0; omp_set_num_threads(2); #pragma omp parallel for for (int j = 1; j <= max; j++){ sum += j; } printf("The sum of the numbers from 1 to %d is %d.\n ", max, sum); return(0); } |

Of course why would we calculate the sum this way? We know that there's a better way to calculate this sum: $(1 + 100)\cdot 100/2 = 5050$. No need to use a program. But after all, this is just an example, so let's do it anyway. Let's recall how to compile this program, assuming it is in a file called **sum.c**:

1 |
gcc sum.c -o sum -fopenmp |

Now we've got a shiny new binary called **sum**. Let's run it:

1 2 |
./sum The sum of the numbers from 1 to 100 is 3218. |

Oh my goodness! What the devil is going on here? That's not the right answer! Moreover, try running this program a bunch of different times. I get numbers like 2372, 3218, … looks like a giant HOLE in the logic of the universe! Quantum annihilation!

Just joking. What's happening here is that in the loop, the **sum** variable in different threads has different local copies and you're just accessing one of them at the end. That's kind of weird if you think about it, because the **sum** variable is just one variable declared outside the parallel code block. However, that's the way parallel computing works.

What you really want to do is take a local **sum** variable from each thread and combine them all together at the end. This is called *reduction*, in parallel computing terms. There is a special syntax and the corresponding new **main** function goes like this:

1 2 3 4 5 6 7 8 9 10 11 |
int main(int argc, char** argv){ int max = 100; int sum = 0; omp_set_num_threads(2); #pragma omp parallel for reduction (+:sum) for (int j = 1; j <= max; j++){ sum += j; } printf("The sum of the numbers from 1 to %d is %d.\n ", max, sum); return(0); } |

It's simple: after the **parallel for** in the **pragma**, append

1 |
reduction (+:sum) |

The keyword **reduction** means take all the local copies. Local copies of what? The stuff in parentheses is what. **(+:sum)** says combine all the local copies (with addition) of all the variables in the list following the colon. In this case, we just want the sum variable. If you need more local copies added up, you would use

1 |
reduction (+:sum,sum2) |

This would add up all the local copies per thread for the variables **sum** and **sum2** (our program didn't have a **sum2** variable, but if it did, this is what we'd use). Here is an example to calculate a product:

1 2 3 4 5 6 7 8 9 10 11 |
int main(int argc, char** argv){ int max = 100; double product = 1; omp_set_num_threads(2); #pragma omp parallel for reduction (*:product) for (int j = 1; j <= max; j++){ product = product*(40/ (double) j); } printf("Their product is %f.\n", product); return(0); } |

Again, without the reduction, the correct product may not be calculated. However, if you are taking the product over a very small set of elements (say calculating the factorial of ten), then the result will *probably* all be executed in a single thread. Obviously, in the real world, you'd be doing much larger computations.

Finally, if you happen to need a bunch of thread-copies of variables collected under different operations, the syntax is like this:

1 |
#pragma omp parallel for reduction (+:var1,var2) reduction product (*:var3,var4) |

In general, the operators you can use are:

- Arithmetic ones: +,*,-
- Logical ones: &, |, ^, &&, ||

## Summary of Lesson

In short:

- To parallelize for loops, precede them by:

1#pragma omp parallel for - If tallying any of the results into a variable, use reductions
- More flexibility can still be obtained with the usual
**#pragma omp parallel**block and obtaining thread ID's manually. In this case you probably will need the following functions defined in omp.h:**omp_get_thread_num();**and**omp_get_num_threads();**