Notes for CS 670. -------------------- * Final exam + Tuesday, April 30, 10am-noon * Final exam topics + Be able to do code in OpenMP and MPI + Be able to answer questions about code and logic of pthreads. + General questions about || programming - run-time, syncrhonization issues, deadlock, etc. + Chapters in the PPM book: 1, 2, 4, 7, 8 + Algorithms: prime sieve, prime number tests, merge sort, solve linear system. + Questions from exam1, exam2, homeworks. * Final exam format + Each of the following is worth 1/3 of the points + Finish making nbody.cpp MPI, due before exam starts - NO HELP ALLOWED. NO SEARCHING THE INTERNET ALLOWED. You can only search online for what the MPI functions are. You should not be trying to search for MPI code that solves the nbody problem. First, it would be obvious to me if you found that. And second, you need to finish the code from class - not come up with your own. + On-paper questions. + A single on-computer question. Some type of computational problem where your score will depend on how fast your solution is. So a single-threaded solution is worth something, but a correct and fast multi-threaded or MPI version will be worth more. * Final exam scoring. + Either worth what the syllabus says (exam1=10% of total grade, exam2=10%, final=15%). + Or, final is worth all of the exam part (exam1=0%, exam2=0%, final=35%). + Whichever is better for you. * Project grading. + Presentation - we will schedule a time late next week for you to show each other your projects. It is for fun and not worth any points per se. + Final code - must be (1) commented nicely, (2) organized nicely, (3) correct, and (4) parallel. If the code is not correct, expect a failing grade for the project. If it is correct but not parallel expect at most a C for the project. If it is correct and parallel but is not nicely commented or organized, expect at most a B for the project. * Notes for today + solve linear equations. - fix that barrier slowing things down issue? + examples left to look at: - Monte Carlo ___ - MPI - FFT - OpenMP, OpenCL + maybe mention briefly, if at all: - parallel programming in Python, R, Hadoop + why big cluster? - lots of data spread over many nodes, many small requests (e.g., web server), or lots of calculation (e.g., factoring). * For you to do * For Jeff to do + integrateOMP fix, some other Monte Carlo... * Previous to do and notes... + Exam: on the computer, solving a simple MPI program. There may be one that has a bug in it to fix. + Pracice - fix mergeBitonicMPI.cpp so it sorts numbers first, meaning it will be correct even for one node. Also, make it faster... + Practice problem #1 - random numbers, compute how many are > 2^30 and how many are smaller. + Practice problems - - given an adjacency matrix, + count how many edges in the graph + find vertex of highest degree + is the graph connected. - given a matrix, + determine if it is in upper-triangular form + compute the square of the matrix (matrix times itself) - given a list of numbers, + determine the max, min, total, average, and variance + determine the same but for only subsets of the numbers (e.g., positive, even, prime, ...) + count the number of ___ (primes, evens, ...) + sort the list - given a number n, + determine if 2^n - 1 is prime + determine the number of primes at most n (assume n is at most, say, 100 million - get it to work first for even smaller n) - statistics + have each process run the command uptime and tabulate the results on the root node - sort the nodes by increasing uptime. - see uptimeExample.cpp + Figure out your project, come and talk to me about it... + send me your code that works (FFT, quicksort, fox, factoring) so I can put it in the class code. + pick a project - see project.txt + do Fox yourself on, say, 4x4 or 6x6 matrix... Verify you get the correct answers in the end. + for Fox, how much message traffic is there if it is nxn matrices. + do the mergeSort running time calculations yourself and make sure you get the same as Jeff. + with mergeSort, why can't you do parallel on for loop in merge function? + mergeSort with OpenMP. + talk about thread-safe versus reentrant. + Make sure students can compile/run MPI programs. + review perfect code from class. + hw3 - Lucas-Lehmer test. + look at factorMPI.c + make Miller-Rabin test parallel as well. + make the whole thing MPI. + next time - go over MPI. + Fix compile error on sorting. + read chapters on message passing, MPI in PPM book. + read Fermat test, Miller-Rabin test on wikipedia. + Comment your code!!! So you explain what it is doing, how it works, etc. + what was wrong with factorOpenMP2b.c - one thing is that sqrtn is a double but I was printing it as an int (so it was printing nonsense). + make factoring example actually factor. + look at a few different hw1's and see problems with not locking updates to shared variables. - also, does it matter if use parallel on nested for loops or not - do a timing experiment. + factorOpenMP4.c - why is stopping with "left over (prime or 1) is 25"? all the other threads finished already, and there was just one left over that found the factor. then that thread set up its new min and max, which didn't happen to contain 5. solve this problem by reorganizing and doing recursive calls any time we find a non-trivial factor. + get down to prime factors (recursive...) + GMP * Exercises (you should do these on your own, I won't grade them). + *** OpenMP *** Pragmas... * #pragma omp parallel will make the next statement execute in parallel on however many threads are being used. - normally make the next statement a compound statement. - local variables in that compound statement are private to each thread. - global variables and locals declared before the #pragma are shared between threads (so using them may require locking/synchronization). - can specify private(x) to make a private copy of x for each thread, though it will not be initialized to anything. - use firstprivate(x) to initialize each private x to what it was before the #pragma omp parallel. - lastprivate(x) initializes to last value of x from any of the threads. * #pragma omp parallel for - makes next for loop split up between threads. * #pragma omp critical - next compound statement is a critical section - meaning if a thread is running that code, no other threads will be running simultaneously. - can specify optional name (#pragma omp critical blah) so different critical section segments will exclude each other. * #pragma omp sections, and #pragma omp section - can be used to specify multiple sections of code that each have to run sequentially, but can run all the sections in parallel. * #pragma omp single - next statement run by only one of the threads. * #pragma omp barrier - no threads go past this point until all are ready to. * #pragma omp for reduction (+:z) - add up all values of the variable z from all threads, after executing for loop each time. (e.g., adding up all terms in an array). Can also use *, -, &, ... * #pragma omp atomic - next statement is an update to a shared variable (e.g., x += 2), and it is made atomic (can be more efficient than using critical section or locks). Libary calls... * omp_set_num_threads - sets # of threads to use. otherwise "optimal" default value will be chosen. * omp_get_thread_num - gets the threads id, used to identify which thread is running. * omp_get_wtime - gets "wall" time. Use it for timing execution. Locks * omp_init_lock - initialize/create a lock variable. * omp_destroy_lock - destroy it. * omp_set_lock - get control of the lock variable (wait if needed). * omp_unset_lock - give up control of it. * omp_test_lock - try to get control of lock, return 1 if success, 0 if already locked. * same functions but with "next" (e.g. omp_init_nest_lock) for locks that can be set multiple times in a row by the same thread... *** Sage *** Sage is a program similar to Maple and Mathematica. It is installed on CS, and I think on the x-lab and y-lab computers. You start it by just typing "sage" in the shell. That starts up a text-based version of sage, which you can run in putty. You can then type sage commands. Search online for some examples, e.g. - "sage prime". * Finding a prime. - type "next_prime(10**10)" to find the next prime after 10^10. * Factoring. - type "factor(12345)" to factor 12345. - seems to factor numbers up to 40-50 digits within a second. to test this try factor(next_prime(10**20)*next_prime(10**22)) *** Notes on good || programming *** * Remember to lock updates to shared variables. * Dividing work up between threads... + In OpenMP, it may be good to have #pragma omp parallel for inner loops as well as the outer loop. Reason is that maybe some of the outer loop threads would finish early, and then some of the cores wouldn't be used while the other threads finish. * Avoiding updates to shared variables. + OpenMP for reduction - instead of locking on each update, just keep local copies of a sum, then add them up at the end. In general, if we can keep local variables and just combine them at the end, that could be faster. * Any libraries using in multi-threaded code, need to make sure they are thread safe. If not, then only call them inside of critical sections. * When using OpenMP, careful to not let it create too many threads. For example, if doing quickSort recursively where each recursive call is a thread, at a certain recursion depth don't create new threads. *** Problems to Solve *** * Factoring - algorithms to try + trial division + Pollard's rho algorithm + Dixon's algorithm + Quadratic sieve * n-Queens * NP-hard problems - satisfiability - sudoku solving - TSP * Graph problems - shortest path - spanning tree *** Perfect Numbers *** * 28 = 2*2*7. All factors = 2, 7, 2*7, 2*2 * 3*3*5*7*11*11. All factors = 3, 5, 7, 11, 3*3, 3*5, 3*7, 3*11, ... * So, for naive perfect number test, instead of trying all numbers from 1 up to n/2, just factor n and then compute "all factors" from the prime factorization. * But, of course, testing if 2^p -1 is prime will be faster than that. * Note: for testing if 2^p-1 is prime, we can do better than just trial division for prime testing. See: http://en.wikipedia.org/wiki/Primality_test + For trial division, if p is n bits long, then it takes 2^(n/2) = sqrt(p) *** MPI *** * Basic idea: splitting work between different processes, probably on various computers (potentially even different kinds of computers). + Difference from shared-memory OpenMP - there you could declare something as a global variable, and use that to pass information between threads. - That also created problems of needing to use atomic/critical for updating shared variables. + With MPI, it is a message-passing paradigm - share data by sending and receiving messages. * Basic MPI functions... + MPI_Init - sets up MPI framework, so you'll be able to send messages. Always done at beginning of program. + MPI_Finalize - close down MPI stuff, always done right at the end. + MPI_Comm_size - gets number of processes. + MPI_Comm_rank - my id (some # betwen 0 and comm_size-1) + MPI_Send + MPI_Recv - There are various versions of these: * blocking versus non-blocking * some others as well ... + MPI_Reduce - all tasks send a value, and they are combined in some way - min, max, sum, ... + MPI_Barrier - all wait until everyone is at the same point. + MPI_Bcast - sending message to everyone. + MPI_Scatter - divide up data between all tasks. + MPI_Gather - data from all the tasks comes to one task. + ... + MPI_Wtime - "wall/elapsed" time in seconds * Compiling/running + mpicc filename.cpp + mpiexec -configfile myConfigFile ** Thread Safe/Reentrant ** + Thread safe - multiple threads can be using the function, and it's okay. Basically, each call of the function either just uses local variables or does appropriate locking on shared variables. So, strlen is thread safe - all it uses is the parameter that is passed in. + Reentrant - is it okay to call the function from a an interrupt service routine. - isr - function that gets called when an interrupt happens - that is, when there is some hardward/data ready to be processed. - see http://en.wikipedia.org/wiki/Reentrancy_%28computing%29 - isr() swap(&x,&y) t = 1 x = 2 isr() x = 1 y = 2 swap(&x, &y) t = 1 x = 2 y = 1 y = 1 - is it just the same as if isr had been run twice back to back? - to do - what's up with that example? ** Pthread ** + Shared memory like OpenMP, but separate functions funning like MPI. + Initially one process running, it can then create new threads. Once a new thread is created, it runs until pthread_exit. + Basic functions * pthread_create - creates a new thread, the thread starts running some function. * pthread_attr_init * pthread_join - waits for given thread to finish. * pthread_exit - each thread must call this when done, including the "main" thread. * pthread_mutex_init - create a mutex variable. * pthread_mutex_lock - get control of a mutex variable. * pthread_mutex_unlock - release control. ** Sorting ... ** * Another way to think about mergesort. 5 1 6 3 2 10 8 7 5 1 6 3 2 10 8 7 1 5 3 6 2 10 7 8 1 3 5 6 ... 1 3 5 6 2 7 8 10 ... * MergeSort hw... + For for loops, if what happens for one value of i depends on what happened on previous values of, should not parallel-ize. Examples - for loop on merge in mergeSort.cpp. outside for loop in mergeSort in mergeSort.cpp + For loops where each value of i is independent of the rest, it's okay to parallelize. Examples - inner for loop in mergeSort.cpp in mergeSort function. - for loop with trial division for factoring. + Two ways to make mergeSort.cpp parallel - (1) make inner for loop in mergeSort function parallel. - (2) in main, split into multiple threads (with parallel pragma). Then each thread allocates random numbers, sorts them, then they are all merged back together in the end. + Note - openMP and MPI together seems faster than just MPI. But then again that's just the way we had it set up... + Example of running time... (a) 16 MPI processes, each runs OpenMP to sort 4000000 numbers. Mergesort is n*log(n). 4000000 * log(4000000) is about 22*4000000. But using 4 cores, so running time there is about 22*1000000. 16 of those happen all in parallel, so wall-time elapsed is about 22*1000000. Next those 16 things have to be merged, that happens on one process. * merging two 4millions into an 8million takes roughly 8million. * doing that for 8 pairs, so that's about 64million. * in other words, one level of the recursion in mergeSort takes about n. * altogether 4 "levels" of merging, each takes about n=64million. * So add on another 4*64million. Total is 22million+256million = 278million. * I mean, oops, that final merge was in parallel over 4 cores because it's running one of the omp processes. * So when merging 16, 8, things - divide the time by 4. * When merging 4 - divide time by 2. * When merging 2 - divide time by nothing (merging 2 has to be single-threaded). (b) 64 MPI processes, each runs single-threaded to sort 1million numbers. * Each of those 64 takes about 1million*log(1million) = about 20*1million. * Those all happen at the same time. * At the end, merge together 64 different things that are all 1million. * There will be log(64) = 6 different levels of merging that happen before everything is combined. * Each level of the merging looks at every number, so 64million. * So the final allMerge takes 6*64million = 384million. * So altogether, takes 404million. ** GDB ** + GNU debugger + when you do gcc, g++, add -g to your command. g++ blah.cpp -g + run gdb with your executable name to load it into the debugger. gdb a.out + When inside of gdb, there are some commands... - run run [command line parameters] - backtrace shows the call sequence. - print - prints variables print x print x[10]