Course Notes Parallel Algorithms (WISM 459), 2014/2015
Parallel Algorithms (WISM 459), 2014/2015
Final assignement has been graded (Feb. 11, 2015, 10.00 hour)
Time and place
Every Wednesday from 10.15-13.00 hour at
Utrecht University, campus De Uithof.
Location of lectures: room
079 in the Buys Ballot building.
First lecture: September 10, 2014.
Last lecture: December 17, 2014.
No class on November 12.
Each session consists of 2 times 45 min. lectures and one exercise class
(or computer laboratory class, or question session) of 45 min.
Note that class always starts at 10.15 hour,
because it is part of Mastermath.
On September 17 and November 5, we will start with a lecture in BBG079 and finish
with a computer laboratory in Minnaert 205.
This is a Master's course which is an elective course for students
of the Master's programme Mathematical Sciences
of Utrecht University, part of the Scientific Computing specialisation.
The course is recommended for students in Mathematics or Computer Science
who are interested in scientific computing
and would like to obtain a first `hands-on experience'
in parallel programming.
It is also highly suitable for the Honours master
programme in Theoretical Physics and Mathematical Sciences
and for students interested in Computational Physics.
If you consider taking this course as part of a Bachelor's degree,
please contact the teacher first about the prerequisites.
The course is part of the Dutch National Master
hence is open for all interested master students in Mathematics
in the Netherlands.
Registration for everybody is through Mastermath.
Students will learn how to design a parallel algorithm for a
problem from the area of scientific computing
and how to write a parallel program that
solves the problem.
After completion of the course, the student is able to
design a parallel algorithm for a problem from the area of scientific computing;
analyse the cost of the algorithm in terms of computing time, communication time, and synchronisation time;
- write a parallel program based on an algorithm that solves the problem;
- write a report on the algorithm, its analysis, and numerical experiments performed, in a concise and readable form.
You get 8 ECTS credit points.
Today, parallel computers are appearing on our desktops.
The advent of dual-core and quad-core computers and the expected increase
in the number of cores in the coming years, inevitably will cause a major change
in our approach to software, such as the software we use in scientific computations.
Parallel computers drastically increase our computational capabilities
and thus enable us to model more realistically in many application areas.
To make efficient use of parallel computers, it is necessary to reorganise
the structure of our computational methods.
In particular, attention must be paid to the division of work
among the different processors solving a problem in parallel
and to the communication between them. Suitable parallel algorithms and systems
software are needed to realise the capabilities of parallel computers.
We will discuss extensively the most recent developments in the area
of parallel computers, ranging from multi-core desktop PCs,
to clusters of PCs connected by switching devices,
to massively parallel computers with distributed memory
such as our national supercomputer Cartesius at SurfSARA in Amsterdam.
The following subjects will be treated:
- Types of existing parallel computers
- Principles of parallel computation: distributing the work evenly
and avoiding communication
- The Bulk Synchronous Parallel (BSP) model as an idealised model
of a parallel computer
- Use of BSPlib software as the basis for architecture-independent
Parallel algorithms for the following problems:
prime number sieving, LU decomposition,
Fast Fourier Transform,
sparse matrix-vector multiplication, iterative solution of
linear systems, graph matching.
Analysis of the computation, communication, and synchronisation
time requirements of these algorithms by the BSP model.
The course will be given in English.
All reading material will be in English.
Introductory course in linear algebra.
Some knowledge of algorithms and programming is helpful.
The laboratory classes will use the programming language C.
If you don't know C, this may be an opportunity to learn it.
Some students found the website of
Harvard's CS50 course useful to quickly catch up on C, or
you may consider the book
Practical C Programming by Steve Oualline.
You can also use C++, if you already know that language.
We will make use of the recently released
MulticoreBSP for C software developed by Albert-Jan Yzelman.
This sofware runs on shared-memory multicore PCs, and you can also run
your program without any change on distributed-memory machines
such as Cartesius.
The examination is in the form of two assignments during the course,
each with a weight of 45%, and three homework exercises
with a total weight of 10%.
The two assignments are carried out in the
exercise/laboratory class and at home.
A written report must be returned before the deadline.
For the two assignments, students can work individually or in pairs (but not in larger teams)
on the computer programs and can hand in a single report,
provided each did a fair share of the work
and can account for that.
Homework must be made individually.
If needed, you will have to explain your answers to the homework exercises.
Students from Utrecht must submit reports for the assignments in hardcopy paper format.
Students from outside Utrecht University may
also submit reports electronically in PDF format by email,
but hardcopy is still preferred.
All students must use LaTeX for the assignments; handwritten is OK for the
We closely follow the book
Scientific Computation: A Structured Approach using BSP and MPI (PSC),
by Rob H. Bisseling,
Oxford University Press,
March 2004. ISBN 0-19-852939-2.
Please note that the book is now
available by the printing-on-demand system of Oxford University Press,
at the book's OUP website.
Furthermore, our student union Aeskwadraat may still have a limited number of copies.
I am currently working on a second edition,
and some additional material will be given to you
(and will be tested on you!).
The first chapter of the book is freely available from the
see "Sample material".
In addition, all my transparancies and WWW links to
background material are provided through this course page.
(Pointers to other relevant links are welcome!)
LaTeX sources (in Beamer) and PDFs of all my transparencies
(18 Mbyte in gzipped tar format).
The sources may be of help to other teachers
who want to teach from my book.
Students may find them helpful too.
Last update of the files: November 6, 2014.
The transparancies are complete
and they cover every section of the book.
They have been converted to
the Beamer macros in September 2012, and have been made up to date in the process.
I am adding experimental results on recent computer architectures as we go.
The old version (from 2007) in Prosper can be found
You can also obtain the PDFs separately. Each PDF file represents one
lecture of about 45 minutes.
Chapter 1: sections
Chapter 2: sections
Chapter 3: sections
Chapter 4: sections
Programming in BSP style using MPI,
U. Naumann and O. Schenk (eds), Combinatorial Scientific Computing,
CRC Press, 2012.
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst,
Numerical Linear Algebra for High-Performance Computers,
G. H. Golub and C. F. van Loan, ``Matrix Computations",
Fourth edition, The Johns Hopkins University Press, 2012.
C. F. van Loan, ``Computational Frameworks for the Fast Fourier Transform",
Some weekly summaries and links
Wednesday September 10, 2014. Room 079 Buys Ballot building
The Bulk Synchronous Parallel model
(PSC section 1.2)
What is a parallel computer? Why do we compute in parallel?
The Bulk Synchronous Parallel (BSP) model by Valiant
comprises an abstract machine architecture, a framework for
and a cost function for analysing the run time of algorithms.
The BSP architecture is a set of processor-memory pairs
connected by a black box communication network.
BSP algorithms consist of computation supersteps
and communication supersteps.
Communication is costed as h-relations.
The BSP cost of an algorithm is expressed
in machine parameters p, g, l.
The computing rate is r flop/s.
Motivation of the cost model: bottlenecks at the entrance
and exit of the communication network.
Parallel Inner Product Computation
(PSC section 1.3)
My First Parallel Algorithm: the computation
of the inner product of two vectors.
Possible distributions: cyclic and block.
Data distribution leads naturally to work distribution.
Communication is needed to add local inner products into a global
Single Program, Multiple Data (SPMD) approach:
we don't want to write 400 different programs for a machine with
One-sided communication is a great invention
(in parallel computing, at least;).
In class: Exercise 1.2. Homework: Exercise 1.1
Please hand in exercise 1.1 (and not more), on September 17, 2014.
Wednesday September 17, 2014.
BSPlib, the Bulk Synchronous Parallel library. Room BBG079
(PSC section 1.4)
(PSC sections 1.5-1.7)
Computer Laboratory in room Minnaert 205
(from 12.00-13.00 hour):
starting with BSPlib.
This is your first hands-on session with BSPlib. You may have to set up your accounts.
Download the latest version of
my package of educational programs that
teaches how to use BSP.
Solve Exercise 1.3: try to run the benchmark,
exploring your parallel environment: your own laptop
and Cartesius. Change puts into gets.
If you own a laptop, bring it to class,
so we can install MulticoreBSP for C and check your software
Wednesday September 24, 2014.
Sequential LU Decomposition
(PSC sections 2.1-2.2)
Parallel LU Decomposition
(PSC section 2.3)
Discussion of the homework. Answers to Exercise 1.2.
Starting Exercise 1.7 from PSC. Design a parallel algorithm
for prime number generation.
Write your first parallel
This program should generate all prime numbers below 1,000,000 in parallel.
Choosing the right distribution is the name of the game.
Hand in a report on exercise 1.7 before the deadline,
Wednesday October 15, 10.15 hour.
Note that you can work in pairs.
Wednesday October 1, 2014. Class from 10.15-12.15 hour.
(PSC section 2.4)
Continue work on the prime number sieve. Discuss your approaches with me.
Today's class finishes early, at 12.15 hour.
Wednesday October 8, 2014.
Experiments with bsplu
(PSC sections 2.5-2.6)
Sequential Recursive Fast Fourier Transform (FFT)
(PSC sections 3.1-3.2) A wonderful algorithm.
Continue work on the prime number sieve.
found to be useful if you want to learn some vi editor commands
on the Linux system.
Wednesday October 15, 2014.
Sequential Nonrecursive FFT
(PSC section 3.3)
Parallel Fast Fourier Transform
(PSC section 3.4)
Choose a new assignment. Possibilities:
solve Exercise 2.5 on Cholesky factorisation;
solve Exercise 2.7 on Strassen matrix multiplication;
solve Exercise 2.10 on optimising LU decomposition
(try to get Cartesius higher in the Top500,
improving its LINPACK performance);
write a parallel program for Karp-Sipser maximum-cardinality matching in a general graph;
write a parallel program for counting self-avoiding walks (SAWs) on a 2D or 3D lattice.
for SAWs on the 3D cubic lattice is 36 steps.)
Requests for different topics will be considered.
Please try to discuss the chosen topic with the teacher before mid-November.
Start working on Homework on Chapter 2, Exercise 2.3(a) on optimising the row swap
in LU decomposition.
Hand it in on October 29.
Wednesday October 22, 2014.
(PSC section 3.6)
Please read section 3.5 yourself as a preparation.
Sequential sparse matrix-vector multiplication
(PSC section 4.1)
Work on your new assignment.
Wednesday October 29, 2014
Sparse matrices and their data structures
(PSC section 4.2)
Parallel sparse matrix-vector multiplication
(PSC section 4.3)
Discussion of final project.
Wednesday November 5, 2014.
Cartesian matrix distributions
(PSC section 4.4)
Discussion on solution prime number sieve. Handback graded reports.
Computing laboratory (12.00-13.00 hour) Minnaert 205
Work on your new assignment.
(PSC section 4.5)
New developments in 2D sparse matrix distribution
Movies of matrix partitioning, improvements in quality and speed
by the medium-grain method (Mondriaan 4.0).
Discussion of Homework on Chapter 2. Handback graded homework.
Wednesday November 26, 2014.
(PSC section 4.8)
Program bspmv and bulk synchronous message passing primitives
(PSC section 4.9)
Homework: Exercise 4.2, but with a 10 x 10 grid instead of 12 by 12.
If you use LaTeX, try Tikz to produce a nice figure.
Hand in December 3.
Wired Article on twin primes problem, 22 November 2013.
Twin gap down from 70 million to 600, closing in on 2.
Wednesday December 3, 2014.
Guest lecture Multi-BSP computing: the next step? by Albert-Jan Yzelman, KU Leuven, author of
Over recent years, in preparation of supercomputers that can reach a
performance in the order of several exaflop per second, people have looked for
new programming models. This lecture explores on why these would be required at all,
by briefly re-exploring the original Bulk Synchronous Parallel (BSP) model including its
currently commonly used variants (such as MPI, MapReduce, and Pregel), as
well as by exploring completely different parallelisation methodologies (such as
OpenMP and Cilk Plus).
The founder of BSP, Leslie Valiant, also suggested an alternative paradigm
for parallel computing: the Multi-BSP model. While Multi-BSP is a hierarchical
model, and while hierarchical programming is already found in contemporary
high-performance computing (e.g., MPI plus OpenMP, or MPI plus CUDA),
explicit hierarchical programming has its problems: it is not portable, lacks
transparency, and complicates modular software design.
In this lecture, a first approach to remedy these issues, based on Multi-BSP,
is presented. We illustrate the proposed framework by writing collective
operations in Multi-BSP, and will briefly explore a Multi-BSP version of the fast
Fourier transform as well.
Exploring Multi-BSP by Albert-Jan Yzelman.
Please bring your laptop to class if you possess one.
Wednesday December 10, 2014. Room BBG 017
Sequential graph matching
Parallel graph matching
Discussion of homework on Chapter 4.
Wednesday December 17, 2011.
Message Passing Interface (MPI-1)
(PSC sections C.1, C.2.1-C.2.4)
Message Passing Interface (MPI-2)
(PSC Section C2.5, C3, C4)
Course evaluation. Discuss your final assignment.
Please fill out the
Mastermath digital evaluation form. Use last year's form if needed!
Deadline second assignment January 15, 2015, 17.00 hour.
Hand in a report before the deadline.
Please use the batch system of Cartesius and not the interactive system
for your test runs. The interactive system is only for development.
All ffinal assignement have been graded now (February 11, 10.00 hour). You can
come and get your report back with comments and we will discuss your work.
Other courses that teach the BSP model (among others)
Parallel computing (B-KUL-H03F9A)
at KU Leuven, Belgium, by Albert-Jan Yzelman.
Efficient Parallel Algorithms (CS329)
at University of Warwick, UK,
by Alex Tiskin.
Parallel and Distributed Programming (CS236370)
by Assaf Schuster,
at the Technion, Haifa, Israel.
High Performance and Parallel Computing (22C177)
by Suely Oliveira,
at the University of Iowa.
Parallel computing (in Polish)
by Przemek Stpiczynski at the Maria Curie-Sklodowska University,
Last update of this page: February 11, 2015
Back to homepage of Rob Bisseling