Course Notes Parallel Algorithms (WISM 459), 2017/2018

Parallel Algorithms (WISM 459), 2017/2018


Rob Bisseling. Teaching assistant: Mick van Duijn ( )

Time and place

Every Wednesday from 10.00-12.45 hour at Utrecht University, campus De Uithof. Location of lectures: room 611 in the Hans Freudenthal building. (Please note the new location!) First lecture: September 13, 2017. Last lecture: December 20, 2017. Each session consists of 2 times 45 min. lectures and one exercise class (or computer laboratory class, or question session) of 45 min.

Intended audience

This is a Master's course which is an elective course for students of the Master's programme Mathematical Sciences of Utrecht University, part of the Scientific Computing specialisation. The course is recommended for students in Mathematics or Computer Science who are interested in scientific computing and would like to obtain a first `hands-on experience' in parallel programming. It is also highly suitable for the Honours master programme in Theoretical Physics and Mathematical Sciences and for students interested in Complex systems or Computational Physics. The course is part of the Dutch National Master Mastermath and hence is open for all interested master students in Mathematics in the Netherlands. Registration for everybody is through Mastermath.

Course aim

Students will learn how to design a parallel algorithm for a problem from the area of scientific computing and how to write a parallel program that solves the problem.

Learning goals

After completion of the course, the student is able to


You get 8 ECTS credit points.


Today, parallel computers are appearing on our desktops. The advent of dualcore, quadcore, and octacore computers and the expected increase in the number of cores in the coming years causes a major change in our approach to software, such as the mathematical software we use in scientific computations and in the emerging area of big data computations.

Parallel computers drastically increase our computational capabilities and thus enable us to model more realistically in many application areas. To make efficient use of parallel computers, it is necessary to reorganize the structure of our computational methods. In particular, attention must be paid to the division of work among the different processors solving a problem in parallel and to the communication between them. Suitable parallel algorithms and systems software are needed to realize the capabilities of parallel computers.

We will discuss extensively the most recent developments in the area of parallel computers, ranging from multicore desktop PCs to clusters of PCs connected by switching devices, to massively parallel computers with distributed memory such as our national supercomputer Cartesius at SURFsara in Amsterdam,

The following subjects will be treated:


The course will be given in English. All reading material will be in English.


Introductory course in linear algebra. Some knowledge of algorithms. Good knowledge of a modern programming language such as C, C++, C#, Python, or Java. Basic knowledge of C is helpful, as we will use this language in class, but we will organise the course in such a way that there will be time to adapt to it for those coming from another language. There are plenty of books on C; you may consider the book Practical C Programming by Steve Oualline. You can also use C++ in class, if you already know that language.


We will make use of the recently released MulticoreBSP for C software developed by Albert-Jan Yzelman. This sofware runs on shared-memory multicore PCs, and you can also run your program with only minor changes on distributed-memory machines such as Cartesius.


Recommended: Bring your own device! In certain weeks, marked below by BYOD, it is helpful to bring your own laptop, if you possess one, for use during computer laboratory sessions. Please install the MulticoreBSP library. On Macs and Linux computers this is straightforward. On Windows machines you need a UNIX emulator which runs Pthreads, and it is more difficult to get the software running. If you do not possess a laptop, perhaps your project partner does (you are allowed to work in pairs). If this fails, we will find another solution for you.

You will get access to the national supercomputer Cartesius, where you can install MulticoreBSP for C on one node, giving access to at most 24 cores for thin nodes or you can use BSPonMPI (already installed) giving access to thousands of cores.


The examination is in the form of an initial assignment (20%), a final assignment (40%), a presentation on the final assignment (20%), and a midterm exam (20%). The two assignments are carried out in the exercise/laboratory class and at home. A written report must be returned for each assignment before the deadline. Students can work individually or in pairs (but not in larger teams) on the computer programs and can hand in a single report, provided each did a fair share of the work and can account for that. Each participant has to give an individual presentation on the chosen topic of the final assignment. Presentations will be scheduled on 6, 13, and 20 December 2017. The midterm exam must be passed with a grade of 5.5 (on a scale of 10). All students should submit reports for the assignments electronically in PDF format by email, to the teacher, Rob Bisseling. There is no further homework. All students must use LaTeX for the assignments. For the first assignment, you may submit your parallel program to an automated testing system, Domjudge. Utrecht students (or those who had an Utrecht F-number in the past) can use their Solis-id, others should create a new account. This system helps in testing and also presents a scoreboard. Since it may take some effort to use this system, this is optional and not obligatory.


We closely follow the book Parallel Scientific Computation: A Structured Approach using BSP and MPI (PSC), by Rob H. Bisseling, Oxford University Press, March 2004. ISBN 0-19-852939-2. I am currently working on a second edition (PSC2), and all material needed will be given to you as handouts (and it will be tested on you!).

In addition, all my slides and WWW links to background material are provided through this course page. (Pointers to other relevant links are welcome!)


LaTeX sources (in Beamer) and PDFs of all my slides (18 Mbyte in gzipped tar format). The sources may be of help to other teachers who want to teach from my book. Students may find them helpful too.

Last update of the files: November 6, 2014. The slides are complete and they cover every section of the book (first edition). They have been converted to the Beamer macros in September 2012, and have been made up to date in the process. I am adding experimental results on recent computer architectures as we go.

You can also obtain the PDFs separately. Each PDF file represents one lecture of about 45 minutes.

Chapter 1: sections 2 3 4 5-7
Chapter 2: sections 1-2 3 4 5-6
Chapter 3: sections 1-2 3 4 5 6 7
Chapter 4: sections 1 2 3 4 5 6 7 8 9 10
Appendix C: Programming in BSP style using MPI, 1 (MPI-1) 2 (MPI-2).

Slides for the second edition

We will develop them as the second edition is written. Beta-versions will be provided below as they become available over the coming years, as separate links. 1.8 (2nd ed.) 1.9 (2nd ed.)

Further reading

Some weekly summaries and links of last years course

This year we will cover chapters 1, 3, 4 of the first edition and chapter 5 on graph matching of the second edition.

Wednesday September 13, 2017. Room 611 Hans Freudenthal building

The Bulk Synchronous Parallel model

(PSC section 1.2) What is a parallel computer? Why do we compute in parallel? The Bulk Synchronous Parallel (BSP) model by Valiant comprises an abstract machine architecture, a framework for developing algorithms, and a cost function for analysing the run time of algorithms. The BSP architecture is a set of processor-memory pairs connected by a black box communication network. BSP algorithms consist of computation supersteps and communication supersteps. Communication is costed as h-relations. The BSP cost of an algorithm is expressed in machine parameters p, g, l. The computing rate is r flop/s. Motivation of the cost model: bottlenecks at the entrance and exit of the communication network.

Parallel Inner Product Computation

(PSC section 1.3) My First Parallel Algorithm: the computation of the inner product of two vectors. Possible distributions: cyclic and block. Data distribution leads naturally to work distribution. Communication is needed to add local inner products into a global inner product. Single Program, Multiple Data (SPMD) approach: we don't want to write 400 different programs for a machine with 400 processors. One-sided communication is a great invention (in parallel computing, at least;).

Exercise class

In class: Exercise 1.2. No need to hand in.

Interesting links

Wednesday September 20, 2017.

BSPlib, the Bulk Synchronous Parallel library.

(PSC section 1.4)


(PSC sections 1.5-1.7)

Interesting links:

Computer Laboratory BYOD (from 12.00-12.45 hour): starting with BSPlib.
This is your first hands-on session with BSPlib. Download the latest version of BSPedupack, my package of educational programs that teaches how to use BSP. Note that a beta-version for the second edition is available. Preferably use this version. Solve Exercise 1.3: try to run the benchmark, exploring your parallel environment: your own laptop and Cartesius. Change puts into gets. Also modify your program to send larger messages. How does this influence g? BYOD: If you own a laptop, bring it to class, so we can install MulticoreBSP for C and check your software

Wednesday September 27, 2017.

Parallel sample sort
New material, presenting the BSP approach to sorting by regular sampling. Will become section 1.8 in PSC2.
Program bspsort and bulk synchronous message passing
Will become section 1.9 in PSC2.
12.00-12.45 Guest lecture: Modern BSP
Jan-Willem Buurlage (CWI) will give an overview of a new bulk synchronous parallel library, Bulk, which he is developing together with Tom Bannink (CWI). The library is based on the latest version of C++, C++17. If you are already an experienced C++ programmer, you may want to use this very elegant library instead of BSPlib.

Abstract of the lecture Modern C++ provides features that make it feasible to write efficient, yet safe code. In the first half of this talk, we present Bulk; a BSP library that makes use of these features to simplify the implementation of correct parallel programs using fewer lines of code. In the second half, we will look at Epiphany BSP; a BSPlib implementation for a special mini supercomputer called the Parallella.

At home: start working on the first assignment, Exercise 1.8 from PSC2. You can run it on the testing environment Domjudge (not obligatory, but well worth the effort) and obtain a score on the score board. Design a parallel algorithm for prime number generation. Write your first parallel program. This program should generate all prime numbers below 1,000,000 in parallel. Choosing the right distribution is the name of the game.

Hand in a report on Exercise 1.8 (in PSC2) presenting your algorithm and the experimental results, before the deadline Wednesday October 18, 10.00 hour. Note that you can work in pairs.

Interesting links:

Wednesday October 4, 2017.

Fast Fourier Transform

(PSC sections 3.1-3.2)

Nonrecursive Fast Fourier Transform

(PSC section 3.3)

Exercise class
Continue work on the prime number sieve.

Wednesday October 11, 2017.

Parallel Fast Fourier Transform

(PSC section 3.4)

Weights of the FFT

(PSC section 3.5)

Wednesday October 18, 2017.

Program bspfft

(PSC section 3.6)

Experiments with bspfft

(PSC section 3.7)

Exercise class

Choose a new assignment. Possibilities: You can parallelise your own problem, or choose one of the exercises from the book PSC/PSC2. For example: solve Exercise 1.5 (new) on parallel sorting algorithms; solve Exercise 2.5 on Cholesky factorisation; develop a communication-avoiding LU decomposition with multiple-rank updates; write a parallel program for maximum-cardinality matching in a bipartite graph; solve Exercise 5.2 (new) write a parallel program for solving the n-queens problem on a chessboard; solve Exercise 5.3 (new) write a parallel program for counting self-avoiding walks (SAWs) on a graph or a 2D or 3D lattice; solve Exercise 5.4 (new) write a parallel program for counting triangles in a social network graph; solve Exercise 5.5 (new) write a parallel program for computing betweenness centrality on a graph. Please try to discuss the chosen topic with the teacher before November 8 and register the topic. You can work in pairs, but have to give individual presentations on the topic. To ensure diversity of the presentations, all topics must be different. First come, first serve, so register once you know you want a topic.

Wednesday October 25, 2017

10.00-11.30 Midterm exam on Chapters 1 and 3. Room Gamma of Educatorium building
12.00-12.45 Sequential sparse matrix-vector multiplication

(PSC section 4.1)

Interesting links, previous exams from the archive. Please note that the exams only cover part (Chapter 1) of the material, but not Chapter 3. Also, they last 60 minutes instead of 90.

Wednesday November 1, 2017

Sparse matrices and their data structures

(PSC section 4.2)

Parallel sparse matrix-vector multiplication

(PSC section 4.3)

Exercise class
Discussion of final project.

Wednesday November 8, 2017

11.45-12.45 Cartesian matrix distributions

(PSC section 4.4)

Mondriaan sparse matrix distribution

(PSC section 4.5)

Exercise class
Discuss your final project with me.

Wednesday November 15, 2017

Medium-grain method for sparse matrix distribution

(New material, will become section 4.6 of PSC2)

Exercise class
Work on your final project.

Interesting links:

Wednesday November 22, 2017.

Parallel graph matching
(New material, PSC2 Chapter 5)
Exercise class
Work on final project

Interesting links:

Wednesday November 29, 2017

Guest Lecture Ruud van der Pas

Title: Shared Memory Parallel Performance To The Extreme

Speaker: Ruud van der Pas, Distinguished Engineer, Oracle, Santa Clara, CA, USA

Abstract: In this talk, we explore the boundaries of shared memory parallelization. Through several real-world case studies we show what it takes to make Tera Byte sized memory problems scale when using up to 2,048 threads in a single SPARC based system.

As will be explained, and demonstrated, to achieve this kind of scalability, memory driven optimizations are crucial, but processor pipeline operations like atomic operations, cannot be ignored either.

Wednesday December 6, 2017.

Presentations final project
Exercise class

Wednesday December 13, 2017.

More presentations final project

Wednesday December 20, 2017.

10.00-11.30 Retake midterm exam Room 135 Ruppert building
12.00-12.45 More presentations final project
Also: course evaluation.

Please fill out the Mastermath digital evaluation form.

Deadline second assignment Monday January 15, 2018, 12.00 hour.

Hand in a report before the deadline. Please use the batch system of Cartesius and not the interactive system for your test runs. The interactive system is only for development. Running your parallel program on hundreds of processor cores will give you a bonus.

Frequently asked questions

Other courses that teach the BSP model (among others)

Last update of this page: October 16, 2017
Back to homepage of Rob Bisseling