Course Notes Parallel Algorithms (WISM 459), 2018/2019
Parallel Algorithms (WISM 459), 2018/2019
Teacher
Rob Bisseling.
Time and place
Every Wednesday from 10.0012.45 hour at
Utrecht University, campus De Uithof.
Location of lectures has changed to: room BBG017 in the Buys Ballot building.
First lecture: September 12, 2018.
Last lecture: December 19, 2018.
No class on October 3, 2018.
Each session consists of 2 times 45 min. lectures and one exercise class
(or computer laboratory class, or question session) of 45 min.
Intended audience
This is a Master's course which is an elective course for students
of the Master's programme Mathematical Sciences
of Utrecht University, part of the Scientific Computing track and the Applied Mathematics track.
The course is recommended for students in Mathematics or Computer Science
who are interested in scientific computing
and would like to obtain a first `handson experience'
in parallel programming.
It is also highly suitable for the Honours master
programme in Theoretical Physics and Mathematical Sciences
and for students interested in Complex systems or Computational Physics.
The course is part of the Dutch National Master
Mastermath and
hence is open for all interested master students in Mathematics
in the Netherlands.
Registration for everybody is through Mastermath.
Course aim
Students will learn how to design a parallel algorithm for a
problem from the area of scientific computing
and how to write a parallel program that
solves the problem.
Learning goals
After completion of the course, the student is able to

design a parallel algorithm for a problem from the area of scientific computing or big data;

analyse the cost of the algorithm in terms of computing time, communication time, and synchronisation time;
 write a parallel program based on an algorithm that solves the problem;
 write a report on the algorithm, its analysis, and numerical experiments performed, in a concise and readable form,
 present the results in an oral presentation, in a clear manner, highlighting the most important ideas.
Credits
You get 8 ECTS credit points.
Contents
Today, parallel computers are appearing on our desktops. The advent of dualcore, quadcore, and octacore computers and the expected increase in the number of cores in the coming years causes a major change in our approach to software, such as the mathematical software we use in scientific computations and in the emerging area of big data computations.
Parallel computers drastically increase our computational capabilities and thus enable us to model more realistically in many application areas. To make efficient use of parallel computers, it is necessary to reorganise the structure of our computational methods. In particular, attention must be paid to the division of work among the different processors solving a problem in parallel and to the communication between them. Suitable parallel algorithms and systems software are needed to realise the capabilities of parallel computers.
We will discuss extensively the most recent developments in the area of parallel computers, ranging from multicore desktop PCs to clusters of PCs connected by switching devices, to massively parallel computers with distributed memory such as our national supercomputer Cartesius at SURFsara in Amsterdam,
The following subjects will be treated:
 Types of existing parallel computers
 Principles of parallel computation: distributing the work evenly
and avoiding communication
 The Bulk Synchronous Parallel (BSP) model as an idealised model
of a parallel computer
 Use of BSPlib software as the basis for architectureindependent
programs

Parallel algorithms for the following problems: inner product computation, sorting, prime number sieving, LU decomposition, sparse matrixvector multiplication, iterative solution of linear systems, graph matching;
 Analysis of the computation. communication and synchronisation time requirements of these algorithms by the BSP model;
 Handson experience in the laboratory class, using the Cartesius supercomputer.
Language
The course will be given in English.
All reading material will be in English.
Prerequisites
Introductory course in linear algebra. Some knowledge of algorithms. Knowledge of a modern programming language such as C, C++, C#, Python, or Java. Basic knowledge of C is helpful, as we will use this language in class, but we will organise the course in such a way that there will be time to adapt to it for those coming from another language.
There are plenty of books on C;
you may consider the book
Practical C Programming by Steve Oualline.
You can also use C++ in class, if you already know that language.
Software
We will make use of the
MulticoreBSP for C software developed by AlbertJan Yzelman, version 2.0 (May 2018).
This sofware runs on sharedmemory multicore PCs, and you can also run
your program with only minor changes on distributedmemory machines
such as Cartesius.
C++ programmers can use
Zefiros Synclib as a nice alternative.
If you do not possess a laptop, perhaps your project partner does
(you are allowed to work in pairs). If this fails, we will find another solution for you.
You will get access to the national supercomputer Cartesius,
where you can install MulticoreBSP for C on one node,
giving access to at most 24 cores for thin nodes
or you can use BSPonMPI (already installed) or Bulk giving access to thousands of cores.
Examination
The examination is in the form of an initial assignment (20%), a final assignment (40%), a presentation on the final assignment (20%), and a midterm exam (20%). The two assignments are carried out in the exercise/laboratory class and at home.
A written report must be returned for each assignment before the deadline. Students can work individually or in pairs (but not in larger teams) on the computer programs and can hand in a single report, provided each did a fair share of the work and can account for that. Each participant has to give an individual presentation on the chosen topic of the final assignment.
Presentations will be scheduled on 5 and 12 December 2018.
The midterm exam must be passed with a grade of 5.5 (on a scale of 10).
All students should submit reports for the assignments
electronically in PDF format by email,
to the teacher, Rob Bisseling.
There is no further homework.
All students must use LaTeX for the assignments.
The midterm exam can be retaken at the end of the course (in December) by any course participant.
The maximum of the two exam grades counts as the midterm exam grade.
In case of illness, or for another urgent reason, the final presentation can be retaken
on an individual basis in January.
In case of an insufficient final grade for the whole course (<=5), the participant
is entitled to a retake, which means that one assignment can be redone within six weeks
after the final grading of the course.
Literature
We closely follow the book
Parallel
Scientific Computation: A Structured Approach using BSP and MPI (PSC),
by Rob H. Bisseling,
Oxford University Press,
March 2004. ISBN 0198529392.
I am currently working on a second edition (PSC2),
and all material needed will be given to you as handouts
(and it will be tested on you!).
In addition, all my slides and WWW links to
background material are provided through this course page.
(Pointers to other relevant links are welcome!)
Slides
LaTeX sources (in Beamer) and PDFs of all my slides
(18 Mbyte in gzipped tar format).
The sources may be of help to other teachers
who want to teach from my book.
Students may find them helpful too.
Last update of the files: November 6, 2014.
The slides are complete
and they cover every section of the book (first edition).
They have been converted to
the Beamer macros in September 2012, and have been made up to date in the process.
I am adding experimental results on recent computer architectures as we go.
You can also obtain the PDFs separately. Each PDF file represents one
lecture of about 45 minutes.
Chapter 1: sections
2
3
4
57
Chapter 2: sections
12
3
4
56
Chapter 3: sections
12
3
4
5
6
7
Chapter 4: sections
1
2
3
4
5
6
7
8
9
10
Appendix C:
Programming in BSP style using MPI,
1 (MPI1)
2 (MPI2).
Slides for the second edition
We will develop them as the second edition is written.
Betaversions will be provided below as they become available over the coming years,
as separate links.
1.8 (2nd ed.)
1.9 (2nd ed.)
2.5 (2nd ed.)
3.6 (2nd ed.)
4.6 (2nd ed.)
4.10 (2nd ed.)
Further reading

U. Naumann and O. Schenk (eds), Combinatorial Scientific Computing,
CRC Press, 2012.

G. H. Golub and C. F. van Loan, ``Matrix Computations",
Fourth edition, The Johns Hopkins University Press, 2012.
Some weekly summaries and links of last years course
This year we will cover chapters 1, 2, 4 of the first edition and chapter 5 on graph matching
of the second edition.
Wednesday September 12, 2017. Room 611 Hans Freudenthal building
The Bulk Synchronous Parallel model
(PSC section 1.2)
What is a parallel computer? Why do we compute in parallel?
The Bulk Synchronous Parallel (BSP) model by Valiant
comprises an abstract machine architecture, a framework for
developing algorithms,
and a cost function for analysing the run time of algorithms.
The BSP architecture is a set of processormemory pairs
connected by a black box communication network.
BSP algorithms consist of computation supersteps
and communication supersteps.
Communication is costed as hrelations.
The BSP cost of an algorithm is expressed
in machine parameters p, g, l.
The computing rate is r flop/s.
Motivation of the cost model: bottlenecks at the entrance
and exit of the communication network.
Parallel Inner Product Computation
(PSC section 1.3)
My First Parallel Algorithm: the computation
of the inner product of two vectors.
Possible distributions: cyclic and block.
Data distribution leads naturally to work distribution.
Communication is needed to add local inner products into a global
inner product.
Single Program, Multiple Data (SPMD) approach:
we don't want to write 400 different programs for a machine with
400 processors.
Onesided communication is a great invention
(in parallel computing, at least;).
Exercise class
In class: Exercise 1.2. No need to hand in.
Interesting links
 Top 500 of supercomputers
in the world.
 SURFsara
in Amsterdam, home of the Cartesius national supercomputer
which we will use in our laboratory class and homework. It is a Bullx machine.
Cartesius currently has 47,776 cores organised in thin and fat nodes and 132 GPUs.
Its theoretical peak performance is 1843 Tflop/s.
Wednesday September 19, 2018. Room BBG017 in the Buys Ballot building
BSPlib, the Bulk Synchronous Parallel library.
(PSC section 1.4)
Benchmarking
(PSC sections 1.51.7)
Interesting links:

BSPedupack version 2.0beta05 growing collection of the BSPlib programs of the second edition of the book. Released September 3, 2018.
See also the explanation on the
software
page of Rob Bisseling

MulticoreBSP software for C developed by AlbertJan Yzelman.
Version 2.0 released May 20, 2018.
It is recommended you use this version on your own PC running Linux or MacOS.
Please follow the excellent
Quickstart guide step by step.
For questions on Windows see the
MulticoreBSP Forum.

BSPonMPI
by Wijnand J. Suijlen. Implementation of BSPlib on top of MPI,
enabling use of BSPlib on almost every parallel computer.
Please note, the latest version is 0.4 and can be downloaded from
the website of the
BSP Worldwide organisation.
It has been installed already at Cartesius.
It is loaded by the command 'module load bsponmpi'.

Zefiros Synclib
by Mick van Duijn and Paul Visscher
Computer Laboratory BYOD
(from 12.0012.45 hour):
starting with BSPlib.
This is your first handson session with BSPlib.
Download the latest version of
BSPedupack,
my package of educational programs that
teaches how to use BSP.
Note that a betaversion for the second edition is available.
Preferably use this version.
Solve Exercise 1.3: try to run the benchmark,
exploring your parallel environment: your own laptop
and Cartesius. Change puts into gets.
Also modify your program to send larger messages.
How does this influence g?
BYOD: If you own a laptop, bring it to class,
so we can install MulticoreBSP for C, SyncLib or Bulk and check your software
Wednesday September 27, 2017.
Parallel sample sort
New material, presenting the BSP approach to sorting by regular sampling.
Will become section 1.8 in PSC2.
Program bspsort and bulk synchronous message passing
Will become section 1.9 in PSC2.
12.0012.45 Exercise class: first assigment
Start working on the first assignment, Exercise 1.7 from PSC2.
Design a parallel algorithm
for prime number generation.
Write your first parallel
program.
This program should generate all prime numbers below 10,000,000 in parallel.
Choosing the right distribution is the name of the game.
Hand in a report on Exercise 1.7 (in PSC2) presenting your algorithm and the experimental results,
before the deadline Wednesday October 17, 10.00 hour.
Note that you can work in pairs.
Interesting links:
Wednesday October 3, 2018.
No class
Because of the Woudschoten
conference on scientific computing
Wednesday October 10, 2018.
Sequential LU Decomposition
(PSC sections 2.12.2)
Parallel LU Decomposition
(PSC section 2.3)
Exercise class
Continue work on the prime number sieve. Discuss your approaches with me.
Wednesday October 17, 2018.
TwoPhase Broadcasting
(PSC section 2.4)
Highperformance LU decomposition
(PSC2 section 2.5)
Question hour on exam material
Exam material: the handouts of Chapters 1 and 2.
Wednesday October 24, 2018
11.0012.45 Midterm exam on Chapters 1 and 2. Room 042 in the Ruppert building.
Exam material: only the material of the handouts. Note the different location.
Interesting links, previous exams from the archive. Exams from 2012, 2013 only on Chapter 1 of the material (60 minutes).
Exams 2017 on Chapter 1, 3 (90 minutes).
Wednesday October 31, 2018. Guest lecture by JanWillem Buurlage (CWI). Room BBG017 in the Buys Ballot building
10.0010.45 Bulk: A Modern C++ Interface for BulkSynchronous Parallel Programs
Abstract: The programming landscape today is vastly different from when the BSPlib interface was conceived. Modern programming languages have useful features compared to e.g. the lowlevel C language, but most are not yet suited for high performance computing applications. Bulk is a novel programming interface and library for BSP programs, that leverages modern C++ to strike a balance between performance, expressability and safety. In this talk, I will introduce this library, and show the differences between Bulk and BSPlib by discussing the code for common tasks in both interfaces.
Paper (accepted mansucript):
Bulk: a Modern C++ Interface for BulkSynchronous Parallel Programs by JanWillem Buurlage, Tom Bannink, and Rob H. Bisseling,
in Proceedings EuroPar 2018, Lecture Notes in Computer Science, Vol. 11014, Springer, 2018, pp. 519532.
Published version.
Today's slides on Bulk by JanWillem Buurlage.
11.0011.45 Sequential sparse matrixvector multiplication
(PSC section 4.1)
Introduction to application of sparse matrices. Sparse matrices and computerised tomography.
Today's slides on sparse matrixvector multiplication
by JanWillem Buurlage.
12.0012.45 Exercise class
Choose a new assignment. Possibilities:
You can parallelise your own problem, or choose one of the exercises from the book PSC/PSC2.
For example:
 solve Exercise 1.5 (new) on parallel sorting algorithms;
 solve Exercise 2.7 (new) on boolean matrix multiplication;
 solve Exercise 4.6 (new) on connected components in a 3D image;
 solve Exercise 4.8 (new) on optimal hypergraph partitioning;
 solve Exercise 4.10 (new) on sparse matrixvector multiplication in C++ using Bulk;
 solve Exercise 5.4 (new) on the maximum independent set problem in graphs;
 solve Exercise 5.5 (new) write a parallel program for solving the nqueens problem
on a chessboard;
 solve Exercise 5.6 (new) write a parallel program for counting selfavoiding walks (SAWs)
on a graph or a 2D or 3D lattice;
 solve Exercise 5.7 (new) write a parallel program for counting triangles in a social network graph;
 solve Exercise 5.8 (new) write a parallel program for computing betweenness centrality on a graph.
 solve Exercise 5.9 (new) on optimal bipartite graph matching.
Please try to discuss the chosen topic with the teacher before November 14 and register
the topic. You can work in pairs, but have to give individual presentations
on the topic. To ensure diversity of the presentations, all topics must be different.
First come, first serve, so register once you know you want a topic.
Wednesday November 7, 2018
Sparse matrices and their data structures
(PSC section 4.2)
Parallel sparse matrixvector multiplication
(PSC section 4.3)
Exercise class
Discussion of first assignment and midterm exam.
Wednesday November 14, 2018
11.4512.45 Cartesian matrix distributions
(PSC section 4.4)
Mondriaan sparse matrix distribution
(PSC section 4.5)
Exercise class
Work on final project.
Wednesday November 21, 2018
Mediumgrain method for sparse matrix distribution
(New material, will become section 4.6 of PSC2).
Parallel algorithm for hybridBSP
(New material, will become section 4.10 of PSC2).
Exercise class
Work on your final project and discuss it with me.
Interesting links:
Wednesday November 28, 2018.
Parallel graph matching
(New material, PSC2 Chapter 5)
Exercise class
Work on final project
Interesting links:
Wednesday December 5, 2018.
Presentations final project
Wednesday December 12, 2018.
More presentations final project
Also: course evaluation.
Wednesday December 19, 2018.
11.0012.45 Retake midterm exam Room 409 Hans Freudenthal building
Please fill out the
Mastermath digital evaluation form.
Deadline second assignment Wednesday January 16, 2019, 12.00 hour.
Hand in a report (by email, in printable PDF) before the deadline.
Please use the batch system of Cartesius and not the interactive system
for your test runs. The interactive system is only for development.
Running your parallel program on hundreds of processor cores
will give you a bonus.
Frequently asked questions
Other courses that teach the BSP model (among others)
Last update of this page: December 19, 2018
Back to homepage of Rob Bisseling