Course Notes Parallel Algorithms (WISM 459), 2015/2016
Parallel Algorithms (WISM 459), 2015/2016
Teacher
Rob Bisseling.
Teaching assistant: Abe Wits.
Time and place
Every Wednesday from 10.0012.45 hour at
Utrecht University, campus De Uithof.
Location of lectures: room
169 in the Buys Ballot building.
First lecture: September 9, 2015.
Last lecture: December 16, 2015.
No class on October 7.
Each session consists of 2 times 45 min. lectures and one exercise class
(or computer laboratory class, or question session) of 45 min.
Intended audience
This is a Master's course which is an elective course for students
of the Master's programme Mathematical Sciences
of Utrecht University, part of the Scientific Computing specialisation.
The course is recommended for students in Mathematics or Computer Science
who are interested in scientific computing
and would like to obtain a first `handson experience'
in parallel programming.
It is also highly suitable for the Honours master
programme in Theoretical Physics and Mathematical Sciences
and for students interested in Computational Physics.
The course is part of the Dutch National Master
Mastermath and
hence is open for all interested master students in Mathematics
in the Netherlands.
Registration for everybody is through Mastermath.
Course aim
Students will learn how to design a parallel algorithm for a
problem from the area of scientific computing
and how to write a parallel program that
solves the problem.
Learning goals
After completion of the course, the student is able to

design a parallel algorithm for a problem from the area of scientific computing;

analyse the cost of the algorithm in terms of computing time, communication time, and synchronisation time;
 write a parallel program based on an algorithm that solves the problem;
 write a report on the algorithm, its analysis, and numerical experiments performed, in a concise and readable form,
 give a clear oral presentation of the algorithm, its analysis and performance.
Credits
You get 8 ECTS credit points.
Contents
Today, parallel computers are appearing on our desktops.
The advent of dualcore and quadcore computers and the expected increase
in the number of cores in the coming years, inevitably will cause a major change
in our approach to software, such as the software we use in scientific computations.
Parallel computers drastically increase our computational capabilities
and thus enable us to model more realistically in many application areas.
To make efficient use of parallel computers, it is necessary to reorganise
the structure of our computational methods.
In particular, attention must be paid to the division of work
among the different processors solving a problem in parallel
and to the communication between them. Suitable parallel algorithms and systems
software are needed to realise the capabilities of parallel computers.
We will discuss extensively the most recent developments in the area
of parallel computers, ranging from multicore desktop PCs,
to clusters of PCs connected by switching devices,
to massively parallel computers with distributed memory
such as our national supercomputer Cartesius at SurfSARA in Amsterdam.
The following subjects will be treated:
 Types of existing parallel computers
 Principles of parallel computation: distributing the work evenly
and avoiding communication
 The Bulk Synchronous Parallel (BSP) model as an idealised model
of a parallel computer
 Use of BSPlib software as the basis for architectureindependent
programs

Parallel algorithms for the following problems:
prime number sieving, LU decomposition,
sorting,
sparse matrixvector multiplication, iterative solution of
linear systems, graph matching.

Analysis of the computation, communication, and synchronisation
time requirements of these algorithms by the BSP model.
Language
The course will be given in English.
All reading material will be in English.
Prerequisites
Introductory course in linear algebra.
Some knowledge of algorithms and programming is helpful.
The laboratory classes will use the programming language C.
If you don't know C, this may be an opportunity to learn it.
A reasonable tutorial for beginners is
Learn C programming,
or
you may consider the book
Practical C Programming by Steve Oualline.
If you already know how to program, but need to learn C, you might want to
Learn C the hard way.
You can also use C++, if you already know that language.
Software
We will make use of the recently released
MulticoreBSP for C software developed by AlbertJan Yzelman.
This sofware runs on sharedmemory multicore PCs, and you can also run
your program without any change on distributedmemory machines
such as Cartesius.
Hardware
Recommended: Bring your own device! In certain weeks, marked below by BYOD,
it is helpful to bring your own laptop, if you possess one,
for use during computer laboratory sessions. Please install the MulticoreBSP library.
On Macs and Linux computers this is straightforward. On Windows machines
you need a UNIX emulator which runs Pthreads, and it
is more difficult to get the software running.
If you do not possess a laptop, perhaps your project partner does
(you are allowed to work in pairs). If this fails, we will find another solution for you.
You will get access to the national supercomputer Cartesius.
Examination
The examination is in the form of an initial assignment (30%), a final assignment (40%),
a presentation on the final assignment (15%), and homework (15%).
The two assignments are carried out in the exercise/laboratory class and at home.
A written report must be returned for each assignment before the deadline.
Students can work individually or in pairs (but not in larger teams) on the computer programs
and can hand in a single report, provided each did a fair share of the work and can account for that.
Presentations will be on 9 and 16 December and will be individual.
Homework must be made individually. If needed, you will have to explain your answers to the homework exercises.
There will be 3 times homework to be handed in, spread over the semester.
All students from Utrecht should submit reports electronically in PDF format by email,
to the teacher, Rob Bisseling, with a cc to the teaching assistant, Abe Wits.
No hardcopy is needed.
All homework must be handed in as hardcopy though.
All students must use LaTeX for the assignments; handwritten is OK for the
homework.
For the first assignment, you will have to submit your parallel program to an automated testing system,
which we will become available early October.
An email with instructions how to do this has been sent to you.
Literature
We closely follow the book
Parallel
Scientific Computation: A Structured Approach using BSP and MPI (PSC),
by Rob H. Bisseling,
Oxford University Press,
March 2004. ISBN 0198529392.
Please note that the book is now
available by the printingondemand system of Oxford University Press,
at the book's OUP website.
I am currently working on a second edition,
and some additional material will be given to you
(and it will be tested on you!).
The first chapter of the book is freely available from the
publisher's website,
see "Sample material".
In addition, all my slides and WWW links to
background material are provided through this course page.
(Pointers to other relevant links are welcome!)
Slides
LaTeX sources (in Beamer) and PDFs of all my slides
(18 Mbyte in gzipped tar format).
The sources may be of help to other teachers
who want to teach from my book.
Students may find them helpful too.
Last update of the files: November 6, 2014.
The slides are complete
and they cover every section of the book.
They have been converted to
the Beamer macros in September 2012, and have been made up to date in the process.
I am adding experimental results on recent computer architectures as we go.
The old version (from 2007) in Prosper can be found
here.
You can also obtain the PDFs separately. Each PDF file represents one
lecture of about 45 minutes.
Chapter 1: sections
2
3
4
57
Chapter 2: sections
12
3
4
56
Chapter 3: sections
12
3
4
5
6
7
Chapter 4: sections
1
2
3
4
5
6
7
8
9
10
Appendix C:
Programming in BSP style using MPI,
1 (MPI1)
2 (MPI2).
Further reading

U. Naumann and O. Schenk (eds), Combinatorial Scientific Computing,
CRC Press, 2012.

J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst,
Numerical Linear Algebra for HighPerformance Computers,
SIAM, 1998.

G. H. Golub and C. F. van Loan, ``Matrix Computations",
Fourth edition, The Johns Hopkins University Press, 2012.
Some weekly summaries and links
Wednesday September 9, 2015. Room 169 Buys Ballot building
The Bulk Synchronous Parallel model
(PSC section 1.2)
What is a parallel computer? Why do we compute in parallel?
The Bulk Synchronous Parallel (BSP) model by Valiant
comprises an abstract machine architecture, a framework for
developing algorithms,
and a cost function for analysing the run time of algorithms.
The BSP architecture is a set of processormemory pairs
connected by a black box communication network.
BSP algorithms consist of computation supersteps
and communication supersteps.
Communication is costed as hrelations.
The BSP cost of an algorithm is expressed
in machine parameters p, g, l.
The computing rate is r flop/s.
Motivation of the cost model: bottlenecks at the entrance
and exit of the communication network.
Parallel Inner Product Computation
(PSC section 1.3)
My First Parallel Algorithm: the computation
of the inner product of two vectors.
Possible distributions: cyclic and block.
Data distribution leads naturally to work distribution.
Communication is needed to add local inner products into a global
inner product.
Single Program, Multiple Data (SPMD) approach:
we don't want to write 400 different programs for a machine with
400 processors.
Onesided communication is a great invention
(in parallel computing, at least;).
Exercise class
In class: Exercise 1.2.
Please hand in Exercise 1.1 as Homework 1 (HW1) on September 16, 2015.
Interesting links
 Top 500 of supercomputers
in the world.
 SurfSARA
in Amsterdam, home of the Cartesius national supercomputer
which we will use in our laboratory class and homework. It is a Bull machine.
Cartesius currently has 40960 cores organised in thin and fat nodes.
Its peak performance is 1559 Tflop/s.
Wednesday September 16, 2015.
BSPlib, the Bulk Synchronous Parallel library.
(PSC section 1.4)
Benchmarking
(PSC sections 1.51.7)
Interesting links:
Computer Laboratory BYOD
(from 12.0012.45 hour):
starting with BSPlib.
This is your first handson session with BSPlib.
Download the latest version of
BSPedupack,
my package of educational programs that
teaches how to use BSP.
Solve Exercise 1.3: try to run the benchmark,
exploring your parallel environment: your own laptop
and Cartesius. Change puts into gets.
BYOD: If you own a laptop, bring it to class,
so we can install MulticoreBSP for C and check your software
Wednesday September 23, 2015.
Sequential LU Decomposition
(PSC sections 2.12.2)
Parallel LU Decomposition
(PSC section 2.3)
Exercise class
Starting Exercise 1.7 from PSC. Design a parallel algorithm
for prime number generation.
Write your first parallel
program.
This program should generate all prime numbers below 1,000,000 in parallel.
Choosing the right distribution is the name of the game.
Hand in a report on exercise 1.7 and submit your program to the checking system, before the deadline,
Wednesday October 14, 10.00 hour.
Note that you can work in pairs.
The format for your program is specified in the
primes program specification.
Useful other files:
primes_example_0.out,
primes_example_1.out,
primes_example_2.out,
For Goldbach, note that of course you need to find only one prime pair (p,q) with p+q=2k
for every even number 2k to check the conjecture.
Interesting links:

Guideline mathematical writing by the Department of Mathematics,
Utrecht University, a must for those writing a report for this course.
 Vimtutor,
found to be useful if you want to learn some vi editor commands
on the Linux system.
Wednesday September 30, 2015.
TwoPhase Broadcasting
(PSC section 2.4)
Experiments with bsplu
(PSC sections 2.52.6)
Exercise class
Discussion of Homework 1 (HW1). Answers to Exercise 1.2.
Continue work on the prime number sieve. Discuss your approaches with me.
Wednesday October 7, 2015.
No class because of
40th Woudschoten Scientific Computing conference
Wednesday October 14, 2015.
Sequential sparse matrixvector multiplication
(PSC section 4.1)
Sparse matrices and their data structures
(PSC section 4.2)
Exercise class
Choose a new assignment. Possibilities:
solve Exercise 1.5 on data compression;
solve Exercise 2.5 on Cholesky factorisation;
solve Exercise 2.7 on Strassen matrix multiplication;
develop a communicationavoiding LU decomposition with multiple rank updates;
write a parallel program for maximumcardinality matching in a bipartite graph;
write a parallel program for counting selfavoiding walks (SAWs) on a 2D or 3D lattice using
Epiphany BSP
on the Parallella board
(The current
world record
for SAWs on the 3D cubic lattice is 36 steps.)
Requests for different topics will be considered.
Please try to discuss the chosen topic with the teacher before midNovember.
Start working on Homework 2 (HW2) on Chapter 2, Exercise 2.3(a) on optimising the row swap
in LU decomposition.
Hand it in on October 28.
Wednesday October 21, 2015
Parallel sparse matrixvector multiplication
(PSC section 4.3)
Cartesian matrix distributions
(PSC section 4.4)
Exercise class
Discussion of final project.
Wednesday October 28, 2015.
Mondriaan sparse matrix distribution
(PSC section 4.5)
11.00 11.45 hour guest lecture Daan Pelt (CWI, Amsterdam): New developments in 2D sparse matrix distribution
Slides of the guest lecture
Abstract:
The sparse matrix partitioning problem arises in parallel sparse matrixvector multiplications, where one needs to distribute work evenly between the available processors while minimizing the amount of needed communication between processors. Given a sparse matrix A, we aim to find a partitioning of the matrix nonzeros, such that each part contains at most a certain number of nonzeros, and the number of different parts that own nonzeros in each row and column is minimized. The problem can be written as a hypergraph partitioning problem, and is NPhard to solve. Therefore, in practice, several heuristic methods are used to find good solutions for specific instances. Onedimensional heuristic solvers partition the sparse matrix only in the row or column direction and are computationally efficient, but typically produce suboptimal results. On the other hand, twodimensional heuristic solvers produce better results, but can be prohibitively slow in practice.
In this talk, I will present work of Rob H. Bisseling and me on two methods for solving the sparse matrix partitioning problem. The first method is a heuristic solver [1], based on translating the sparse matrix A to a different sparse matrix B with a specific structure. This structure allows us to obtain very good results by partitioning B with a fast onedimensional heuristic and translating the resulting partitioning of B to a partitioning of A. The method produces twodimensional partitionings of A, while retaining the computational efficiency of standard onedimensional heuristic solvers. This method, called the mediumgrain method, is now the default in the Mondriaan sparse matrix partitioning package [2]. The second method is an exact solver [3], i.e. one that finds the optimal partitioning for a given matrix. Since the problem is NPhard, we accept much longer computation times for an exact solver, and limit the size of the problems we expect to solve. However, analyzing optimal partitionings of sparse matrices might help improve current heuristics. The exact method is based on a branchandbound framework, with several lower bounds for partial solutions. We show that the exact method is able to find the optimal solution for the majority of small test matrices, and even for a few larger matrices with special structures. This method will soon become available as the tool MondriaanOpt within the Mondriaan package.
References:
[1] Pelt, D.M., Bisseling, R.H., "A MediumGrain Method for Fast 2D Bipartitioning of Sparse Matrices," IEEE 28th International Parallel and Distributed Processing Symposium 2014, pp. 529539, 1923 May 2014
[2] http://www.staff.science.uu.nl/~bisse101/Mondriaan/
[3] Pelt, D.M., Bisseling, R.H., "An exact algorithm for sparse matrix bipartitioning,"
Journal of Parallel and Distributed Computing, Vol. 85 (2015) pp. 7990.
Exercise class
Handback graded reports.
These count for 70% of the grade of the first assignment
The program counts for 30%.
Interesting links:

Slides on the mediumgrain method by Daan Pelt and Rob Bisseling,
presented May 2014 at IPDPS'14.

A mediumgrain method for fast 2D bipartitioning of sparse matrices by
D. M. Pelt and R. H. Bisseling,
Proceedings IEEE International Parallel & Distributed Processing Symposium 2014,
IEEE Press, pp. 529539.

MondriaanOpt for optimal sparse matrix bipartitioning,
Database of optimally partitioned matrices for 2 processors,
with nice pictures.

Edgebased graph partitioning,
Slides of lecture on mediumgrain and optimal partitioning, Sparse Days at St. Girons, France, June 30, 2015.
Wednesday November 4, 2015.
Laplacian matrices
(PSC section 4.8)
Program bspmv and bulk synchronous message passing primitives
(PSC section 4.9)
Interesting links:

Parallel Search,
Wikipedia page about speeding up search algorithms (such as employed in computer chess)
by parallelisation.
Exercise class
Discussion of solution prime number sieve + feedback on computer program results.
Announcement of winner of best program competition.
Homework HW3: Exercise 4.2, but with a 12 x 16 grid instead of 12 by 12.
If you use LaTeX, try Tikz to produce a nice figure.
Hand in November 25.
Wednesday November 11, 2015.
Guest lecture AlbertJan Yzelman (Huawei Research, Paris): Nextgeneration sharedmemory parallel computing
AlbertJan is author of MulticoreBSP.
Slides of the guest lecture
Abstract:
Distributedmemory parallel programming has been the main area of focus for the Bulk Synchronous Parallel (BSP) model. This was true in the past, and thanks to contemporary popular frameworks such as MapReduce, Spark, and Pregel, remains true now.
The realm of sharedmemory computing, obviously becoming more and more important with the advent of manycore processors, can also benefit from the structured way of parallel programming that BSP imposes.
Using the hard problem of sparse matrixvector (SpMV) multiplication as an example both relevant to scientific computing as well as to graph computing,
this lecture explores the use of BSP in the design of highperformance algorithms. The SpMV multiply is hard because it requires irregular memory access
and suffers from various tendencies in modern architectures: low bandwidthpercore, increasingly nonuniform memory hierarchies.
This makes use of unbuffered communication mandatory, requires careful data placement, and perhaps even requires the adaptation of the flat BSP model into a hierarchical and memoryaware bridging model.
The BSP approach is compared to other paradigms in parallel computing, such as parallelfor, forkjoin, and generic finegrained computing. The design of algorithms in the hierarchical MultiBSP model is illustrated as well.
Exercise class
You can use this opportunity and ask questions and give feedback on MulticoreBSP.
Also, return of HW2.
Wednesday November 18, 2015. BBG 169
Exercise class (10.0010.45 hour)
Discussion of HW2. Questions on final project.
Please register your choice and discuss with me if needed.
Lecture by Abe Wits: Epiphany BSP (11.0011.45 hour)
Developed by Abe Wits, JanWillem Buurlage, and Tom Bannink.
Slides of the lecture
Abstract:
The Epiphany chip is a 16 core chip with only 32 kb RAM per core, and incomplet
e documentation. It is also the world's most energy efficient processor! How to
write parallel C code for such a platform? What if you want to reuse your code
on a supercomputer and your laptop? We will discuss one possible solution: BSP,
a parallel programming model. Learn about programming on the Parallella chip,
the Epiphany architecture, BSP, streaming data structures and more!
With two friends, I developed Epiphany BSP, check http://codu.in/ebsp/ for more
information.
The Parallella is like a 100$, overpowered, 18 core raspberry pi.
Exercise class (12.0012.45 hour)
Question hour.
Interesting links:
Wednesday November 25, 2015.
Sequential graph matching
(new material)
Parallel graph matching
(new material)
Exercise class
Question hour. If you have not registered your final presentation yet, please do so as soon as possible.
Interesting links:
Wednesday December 2, 2015.
Programming in BSP style using MPI1 (Appendix C), 10.0011.00 hour
2 presentations final project, 11.1511.45 hour
Exercise class, 12.0012.45 hour
Question hour.
Interesting links:

Wired Article on twin primes problem, 22 November 2013.
Twin gap down from 70 million to 600, closing in on 2.

Sinterklaasgedicht door Wiebe Haanstra (in Dutch),
poem about the frustrations of parallel programming
Wednesday December 9, 2015.
9 presentations final project
Exercise class
Return of HW3.
Wednesday December 16, 2015.
10 presentations final project
Exercise class
Course evaluation.
Please fill out the
Mastermath digital evaluation form. Use last year's form if needed!
Deadline second assignment Monday January 18, 2016, 12.00 hour.
Hand in a report before the deadline.
Please use the batch system of Cartesius and not the interactive system
for your test runs. The interactive system is only for development.
Frequently asked questions
Other courses that teach the BSP model (among others)
Last update of this page: November 25, 2015
Back to homepage of Rob Bisseling