 ### High performance computation of Pi

In this lab, we will write progressively more efficient computation of Pi (yes, that 3.14.... constant). First, the idea of approximating the value of Pi is explained in these two videos, video 1 and video 2.
Please remember that it is not necessary to know calculus for completing this lab, but it will be helpful if you follow the explanations in the video.

The sequential code for approximating Pi is here.

You can find the first 100,000 digits of Pi from this web page. You should compare your output with these digits to see how accurately your program is approximating Pi.

• Understand the code by listening to the videos.
• Time the program with increasing values of numSteps and check how accurately your program is approximating Pi.
• Using OpenMP, parallelize your code using a default number of threads and after that by increasing the number of threads. First use just
#pragma omp parallel
Here is an attempt to parallelize the sequential code. Execute this code and see the result. We should expect each thread will compute the correct value of Pi and print it. Does it give the correct result for each thread? What is going wrong? Remember all variables in this code are shared by all threads. What should you change to correct this code so that each thread prints better approximations of Pi?
• Now notice the time taken by the sequential program and your correct multi-threaded program. You may have to use a high value of numSteps for getting non-zero time. Why is the multi-threaded program taking longer, even though each thread is executing independently in parallel? The answer is related to the cache coherency problem, or false sharing problem that we discussed in the first lecture. Try to understand this
• Now we will parallelize the for loop. The skeleton of the code will look like this:
#pragma omp parallel
#pragma omp for
Of course you have to parallelize the for loop from the correct code in the previous step. There are two important points to remember:
1. Now each thread will compute a slice of the for loop. Hence, you have to allocate an array called sum[Number_of_threads], where Number_of_threads is the number of threads you have used
2. Only one thread should add the elements of this array sum at the end of the for loop. This can be done by using just one thread to compute this final sum. School of Computer Science & Software Engineering The University of Western Australia Crawley, Western Australia, 6009. Phone: +61 8 9380 2716 - Fax: +61 8 9380 1089. CRICOS provider code 00126G 