Parallel Algorithms (WISM 459), 2011/2012


Teacher

Rob Bisseling

Time and place

Every Wednesday from 10.15-13.00 hour at Utrecht University, campus De Uithof. Location of lectures: room 611 Mathematics building. First lecture: September 7, 2011. Last lecture: December 14, 2011. Each session consists of 2 times 45 min. lectures and one exercise class (or computer laboratory class, or question session) of 45 min. Note that class always starts at 10.15 hour, because it is part of Mastermath.

Intended audience

This is a Master's course which is an elective course for students of the Master's programme Mathematical Sciences of Utrecht University specialising in Scientific Computing. (As of September 1, 2011, Scientific Computing is a specialisation of this 2-year programme.) The course is recommended for students in Mathematics or Computer Science who are interested in scientific computing and would like to obtain a first `hands-on experience' in parallel programming. It is also highly suitable for the Honours master programme in Theoretical Physics and Mathematical Sciences and for students interested in Computational Physics. If you consider taking this course as part of a Bachelor's degree, please contact the teacher first about the prerequisites. The course is part of the Dutch National Master Mastermath and hence is open for all interested master students in Mathematics in the Netherlands. Registration for everybody is through Mastermath.

Course aim

Students will learn how to design a parallel algorithm for a problem from the area of scientific computing and how to write a parallel program that solves the problem.

Credits

You get 8 ECTS credit points.

Contents

Today, parallel computers are appearing on our desktops. The advent of dual-core and quad-core computers and the expected increase in the number of cores in the coming years, inevitably will cause a major change in our approach to software, such as the software we use in scientific computations. Parallel computers drastically increase our computational capabilities and thus enable us to model more realistically in many application areas.

To make efficient use of parallel computers, it is necessary to reorganise the structure of our computational methods. In particular, attention must be paid to the division of work among the different processors solving a problem in parallel and to the communication between them. Suitable parallel algorithms and systems software are needed to realise the capabilities of parallel computers.

We will discuss extensively the most recent developments in the area of parallel computers, ranging from multi-core desktop PCs, to clusters of PCs connected by switching devices, to massively parallel computers with distributed memory such as our national supercomputer Huygens at SARA in Amsterdam.

The following subjects will be treated:

Language

The course will be given in English. All reading material will be in English.

Prerequisites

Introductory course in linear algebra. Some knowledge of algorithms and programming is helpful. The laboratory classes will use the programming language C. If you don't know C, this may be an opportunity to learn it. You can also use C++, if you already know that language. For those who know Java, and would prefer to use this language, use can be made of the recently released MulticoreBSP software developed by Albert-Jan Yzelman under the supervision of Rob Bisseling. This sofware runs on shared-memory multicore PCs, but not on distributed-memory machines such as Huygens.

Examination

The examination is in the form of two assignments during the course, each with a weight of 45%, and a small written examination on the first chapter of the book with a weight of 10%. The small examination must be passed with grade 6 (out of 10) or higher. The two assignments are carried out in the exercise/laboratory class and at home. A written report must be returned before the deadline. Students can work in pairs (but not in larger teams) on the computer programs, but the reports must be written individually. Programs produced in pairs must be clearly identified as such (Only one program per pair needs to be given to me.) Students from Utrecht must submit in hardcopy paper format, and you may use any word processor you like, and even handwriting, if it is readable. But LaTeX is preferred! Students from outside Utrecht University may also submit reports electronically in PDF format by email, but hardcopy is still preferred.

Literature

We closely follow the book Parallel Scientific Computation: A Structured Approach using BSP and MPI (PSC), by Rob H. Bisseling, Oxford University Press, March 2004. ISBN 0-19-852939-2. Please note that the book is now available by the printing-on-demand system of Oxford University Press, at the book's OUP website. Furthermore, our student union Aeskwadraat may still have a limited number of copies.

The first chapter of the book is freely available from the publisher's website, see "Sample material".

In addition, all my transparancies and WWW links to background material are provided through this course page. (Pointers to other relevant links are welcome!)

Transparancies

LaTeX sources (in Prosper) and PDFs of all my transparencies (61 Mbyte in gzipped tar format). The sources may be of help to other teachers who want to teach from my book.

Last update of the files: September 19, 2007. The transparancies are now complete and they cover every section of the book.

I am working on a new version of the slides using the Beamer macros of Latex. They will be made available sometime in the future.

You can also obtain the PDFs separately. Each PDF file represents one lecture of about 45 minutes.

Chapter 1: sections 2 3 4 5-7
Chapter 2: sections 1-2 3 4 5-6
Chapter 3: sections 1-2 3 4 5 6 7
Chapter 4: sections 1 2 3 4 5 6 7 8 9 10
Appendix C: Programming in BSP style using MPI, 1 (MPI-1) 2 (MPI-2).

Further reading

Some weekly summaries and links

Wednesday September 7, 2011. Room 611 Mathematics building

The Bulk Synchronous Parallel model

(PSC section 1.2) What is a parallel computer? Why do we compute in parallel? The Bulk Synchronous Parallel (BSP) model by Valiant comprises an abstract machine architecture, a framework for developing algorithms, and a cost function for analysing the run time of algorithms. The BSP architecture is a set of processor-memory pairs connected by a black box communication network. BSP algorithms consist of computation supersteps and communication supersteps. Communication is costed as h-relations. The BSP cost of an algorithm is expressed in machine parameters p, g, l. The computing rate is r flop/s. Motivation of the cost model: bottlenecks at the entrance and exit of the communication network.

Parallel Inner Product Computation

(PSC section 1.3) My First Parallel Algorithm: the computation of the inner product of two vectors. Possible distributions: cyclic and block. Data distribution leads naturally to work distribution. Communication is needed to add local inner products into a global inner product. Single Program, Multiple Data (SPMD) approach: we don't want to write 400 different programs for a machine with 400 processors. One-sided communication is a great invention (in parallel computing, at least;).

Exercise class
Exercise 1.2 from the PSC book.
Interesting links

Wednesday September 14, 2011. Room 611 Mathematics building

BSPlib, the Bulk Synchronous Parallel library

(PSC section 1.4)

Benchmarking

(PSC sections 1.5-1.7)

Interesting links:

Exercise class
Answers to Exercise 1.2 of the previous week.

Wednesday September 21, 2011. Room 611 Mathematics building

Sequential LU Decomposition

(PSC sections 2.1-2.2)

Computer Laboratory in Room Minnaert 018 (from 11.15-13.00): starting with BSPlib. First assignment.
This is your first hands-on session with BSPlib. You may have to set up your accounts. Download the latest version of BSPedupack , my package of educational programs that teaches how to use BSP. Solve Exercise 1.3: try to run the benchmark, exploring your parallel environment: your own laptop and Huygens. Change puts into gets. Starting Exercise 1.7 from PSC. Design a parallel algorithm for prime number generation. Write your first parallel program. This program should generate all prime numbers below 1,000,000 in parallel. Choosing the right distribution is the name of the game. If you own a laptop, bring it to class, so we can install/check your software

Hand in an integrated report on these two exercises before the deadline, Wednesday November 9, 10.15 hour. No extensions will be given.

Interesting links:

Wednesday September 28, 2011. Room 611 Mathematics building

Parallel LU Decomposition

(PSC section 2.3)

Two-Phase Broadcasting

(PSC section 2.4)

Exercise class

Theoretical exercise, Exercise 2.1: find a non-Cartesian distribution for a 7x7 matrix and 9 processors which is provably optimal for load balance. Ignore the communication cost.

Continue with Exercise 1.7 from PSC. Discuss your parallel algorithm with me.

Wednesday October 5, 2011. Room 611 Mathematics building

Please note: we do have class this week!
Experiments with bsplu

(PSC sections 2.5-2.6)

Sequential Recursive Fast Fourier Transform (FFT)

(PSC sections 3.1-3.2) A wonderful algorithm.

Exercise class

Answer previous exercise.

Continue with Exercise 1.7 from PSC. Discuss your parallel algorithm with me.

Wednesday October 12, 2011. Room 611 Mathematics building

Sequential Nonrecursive FFT

(PSC section 3.3)

Small examination
Test of 60 minutes on Chapter 1. It is a closed-book examination! This exam is from 11.15-12.15 hour
Parallel Fast Fourier Transform

(PSC section 3.4)

Wednesday October 19, 2011. Room 611 Mathematics building

Program bspfft

(PSC section 3.6)

Computer Laboratory in Room Minnaert 016 (from 11.15-13.00)
Continue with solving Exercise 1.7: write a parallel program for generating prime numbers. Run it on Huygens. Get familiar with the interactive and batch mode. Interesting links:

Wednesday October 26, 2011. Room 611 Mathematics building

Experimental results for the FFT

(PSC section 3.7)

Sequential sparse matrix-vector multiplication

(PSC section 4.1)

Exercise class

Work on you assignment.

Wednesday November 2, 2011. Room 611 Mathematics building

Sparse matrices and their data structures

(PSC section 4.2)

Parallel sparse matrix-vector multiplication

(PSC section 4.3)

Exercise class

Answers of the exam. Finish your assignment.

Wednesday November 9, 2011. Room 611 Mathematics building

Today is the deadline of the first assignment. Hand it in at the start of the class.

Cartesian matrix distributions

(PSC section 4.4)

Mondriaan sparse matrix distribution

(PSC section 4.5)

Exercise class

Choose a new assignment. Possibilities: solve Exercise 2.5 on Cholesky factorisation; solve Exercise 2.7 on Strassen matrix multiplication; solve Exercise 3.9 on computing π in parallel (the current world record is 5 trillion digits); write a parallel program for counting self-avoiding walks on a 2D or 3D lattice. Requests for different topics will be considered.

Wednesday November 16, 2011. Room 611 Mathematics building

Guest Lecture Peter Arbenz, professor in Computational Science at the ETH Zurich, and co-author of the book Introduction to Parallel Computing A practical guide with examples in C , Oxford University Press, February 2004. Titel of the Talk: "A Fast Parallel Poisson Solver on Irregular Domains Applied to Beam Dynamic Simulations". Slides of the lecture.

Abstract. We discuss the scalable parallel solution of the Poisson equation within a Particle-In-Cell (PIC) code for the simulation of electron beams in particle accelerators of irregular shape. The problem is discretized by Finite Differences. Depending on the treatment of the Dirichlet boundary the resulting system of equations is symmetric or "mildly" nonsymmetric positive definite. In all cases, the system is solved by the preconditioned conjugate gradient algorithm with algebraic multigrid (AMG) preconditioning. We investigate variants of the implementation of AMG that lead to considerable improvements in the execution times. We demonstrate good scalability of the solver on a distributed memory parallel computer with up to 2048 processors. We also compare our iterative solver with an FFT-based solver that is more commonly used for applications in beam dynamics. We will introduce the FFT-based fast Poisson solver and its parallelization.

Papers:

These are available on his Selected Papers page.

The guest lecture is from 10.15-12.00 with 15 min break, and is followed by an informal discussion of the students with the guest lecturer a question and answers session on the broad topic of "Parallel computing, computational science, and beyond".

Wednesday November 23, 2011. Room 611 Mathematics building

Random sparse matrices

(PSC section 4.7)

Laplacian matrices

(PSC section 4.8)

Second chance for small examination
Repeat test of 60 minutes on Chapter 1. Only if you failed the first examination or want to improve your grade. (The highest grade of the two counts.) The exam is held from 12.00 hour to 13.00 hour.

Wednesday November 30, 2011. Room 611 Mathematics building

Lecture has been cancelled due to circumstances.

Wednesday December 7, 2011. Room 611 Mathematics building.

Program bspmv

(PSC section 4.9)

Experimental results on a beowulf cluster

(PSC section 4.10)

Exercise class

Solution to Exercise 4.2, finding the optimal distribution over 8 processors for a 12 by 12 grid. Discussion of second assignment.

Interesting links:

Wednesday December 14, 2011. Room 611 Mathematics building.

Message Passing Interface (MPI-1)

(PSC sections C.1, C.2.1-C.2.4)

Message Passing Interface (MPI-2)

(PSC Section C2.5, C3, C4)

Computer Laboratory in Room Minnaert 016 from 12.00-13.00 hour.

Deadline second assignment January 16, 2012, 12.00 hour.

Hand in a report before the deadline.

Other courses that teach the BSP model (among others)


Last update of this page: December 6, 2011
Back to homepage of Rob Bisseling