Parallel Algorithms (WISM 459), 2013/2014
Class starts on Wednesday September 18, 2013 at 10.15 hour
in Utrecht. Buys Ballot building Room 017,
Princetonplein 5, Uithof, Utrecht.
Note the change of location!
Teacher
Rob Bisseling
Time and place
Every Wednesday from 10.1513.00 hour at
Utrecht University, campus De Uithof.
Location of lectures: either room 611 Hans Freudenthal building,
or rooms 017, and 020 in the Buys Ballot building.
First lecture: September 18, 2013.
Last lecture: December 18, 2013. Please note: this
is the now the correct date (see the Mastermath site). It was
incorrectly written on this site as December 11.
Each session consists of 2 times 45 min. lectures and one exercise class
(or computer laboratory class, or question session) of 45 min.
Note that class always starts at 10.15 hour,
because it is part of Mastermath.
Intended audience
This is a Master's course which is an elective course for students
of the Master's programme Mathematical Sciences
of Utrecht University, part of the Scientific Computing specialisation.
The course is recommended for students in Mathematics or Computer Science
who are interested in scientific computing
and would like to obtain a first `handson experience'
in parallel programming.
It is also highly suitable for the Honours master
programme in Theoretical Physics and Mathematical Sciences
and for students interested in Computational Physics.
If you consider taking this course as part of a Bachelor's degree,
please contact the teacher first about the prerequisites.
The course is part of the Dutch National Master
Mastermath and
hence is open for all interested master students in Mathematics
in the Netherlands.
Registration for everybody is through Mastermath.
Course aim
Students will learn how to design a parallel algorithm for a
problem from the area of scientific computing
and how to write a parallel program that
solves the problem.
Learning goals
After completion of the course, the student is able to

design a parallel algorithm for a problem from the area of scientific computing;

analyse the cost of the algorithm in terms of computing time, communication time, and synchronisation time;
 write a parallel program based on an algorithm that solves the problem;
 write a report on the algorithm, its analysis, and numerical experiments performed, in a concise and readable form.
Credits
You get 8 ECTS credit points.
Contents
Today, parallel computers are appearing on our desktops.
The advent of dualcore and quadcore computers and the expected increase
in the number of cores in the coming years, inevitably will cause a major change
in our approach to software, such as the software we use in scientific computations.
Parallel computers drastically increase our computational capabilities
and thus enable us to model more realistically in many application areas.
To make efficient use of parallel computers, it is necessary to reorganise
the structure of our computational methods.
In particular, attention must be paid to the division of work
among the different processors solving a problem in parallel
and to the communication between them. Suitable parallel algorithms and systems
software are needed to realise the capabilities of parallel computers.
We will discuss extensively the most recent developments in the area
of parallel computers, ranging from multicore desktop PCs,
to clusters of PCs connected by switching devices,
to massively parallel computers with distributed memory
such as our national supercomputer Cartesius at SARA in Amsterdam.
The following subjects will be treated:
 Types of existing parallel computers
 Principles of parallel computation: distributing the work evenly
and avoiding communication
 The Bulk Synchronous Parallel (BSP) model as an idealised model
of a parallel computer
 Use of BSPlib software as the basis for architectureindependent
programs

Parallel algorithms for the following problems:
prime number sieving, LU decomposition,
Fast Fourier Transform,
sparse matrixvector multiplication, iterative solution of
linear systems, graph matching.

Analysis of the computation, communication, and synchronisation
time requirements of these algorithms by the BSP model.
Language
The course will be given in English.
All reading material will be in English.
Prerequisites
Introductory course in linear algebra.
Some knowledge of algorithms and programming is helpful.
The laboratory classes will use the programming language C.
If you don't know C, this may be an opportunity to learn it.
You can also use C++, if you already know that language.
We will make use of the recently released
MulticoreBSP for C software developed by AlbertJan Yzelman.
This sofware runs on sharedmemory multicore PCs, and you can also run
your program without any change on distributedmemory machines
such as Cartesius.
Examination
The examination is in the form of two assignments during the course,
each with a weight of 45%, and a small written examination on the first
chapter of the book
with a weight of 10%. The small examination must be passed with grade 6
(out of 10) or higher.
The two assignments are carried out in the
exercise/laboratory class and at home.
A written report must be returned before the deadline.
Students can work individually or in pairs (but not in larger teams)
on the computer programs and can hand in a single report,
provided each did a fair share of the work
and can account for that.
Students from Utrecht must submit in hardcopy paper format and you
must use LaTeX.
Students from outside Utrecht University may
also submit reports electronically in PDF format by email,
but hardcopy is still preferred.
Literature
We closely follow the book
Parallel
Scientific Computation: A Structured Approach using BSP and MPI (PSC),
by Rob H. Bisseling,
Oxford University Press,
March 2004. ISBN 0198529392.
Please note that the book is now
available by the printingondemand system of Oxford University Press,
at the book's OUP website.
Furthermore, our student union Aeskwadraat may still have a limited number of copies.
I am currently working on a second edition,
and some additional material will be given to you
(and will be tested on you!).
The first chapter of the book is freely available from the
publisher's website,
see "Sample material".
In addition, all my transparancies and WWW links to
background material are provided through this course page.
(Pointers to other relevant links are welcome!)
Transparancies
LaTeX sources (in Beamer) and PDFs of all my transparencies
(14 Mbyte in gzipped tar format).
The sources may be of help to other teachers
who want to teach from my book.
Students may find them helpful too.
Last update of the files: September 26, 2012.
The transparancies are complete
and they cover every section of the book.
They have been converted to
the Beamer macros in September 2012, and have been made up to date in the process.
The old version (from 2007) in Prosper can be found
here
You can also obtain the PDFs separately. Each PDF file represents one
lecture of about 45 minutes.
Chapter 1: sections
2
3
4
57
Chapter 2: sections
12
3
4
56
Chapter 3: sections
12
3
4
5
6
7
Chapter 4: sections
1
2
3
4
5
6
7
8
9
10
Appendix C:
Programming in BSP style using MPI,
1 (MPI1)
2 (MPI2).
Further reading

U. Naumann and O. Schenk (eds), Combinatorial Scientific Computing,
CRC Press, 2012.

J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst,
Numerical Linear Algebra for HighPerformance Computers,
SIAM, 1998.

G. H. Golub and C. F. van Loan, ``Matrix Computations",
Fourth edition, The Johns Hopkins University Press, 2012.

C. F. van Loan, ``Computational Frameworks for the Fast Fourier Transform",
SIAM, 1992.
Some weekly summaries and links
Wednesday September 18, 2013. Room 017 Buys Ballot building
The Bulk Synchronous Parallel model
(PSC section 1.2)
What is a parallel computer? Why do we compute in parallel?
The Bulk Synchronous Parallel (BSP) model by Valiant
comprises an abstract machine architecture, a framework for
developing algorithms,
and a cost function for analysing the run time of algorithms.
The BSP architecture is a set of processormemory pairs
connected by a black box communication network.
BSP algorithms consist of computation supersteps
and communication supersteps.
Communication is costed as hrelations.
The BSP cost of an algorithm is expressed
in machine parameters p, g, l.
The computing rate is r flop/s.
Motivation of the cost model: bottlenecks at the entrance
and exit of the communication network.
Parallel Inner Product Computation
(PSC section 1.3)
My First Parallel Algorithm: the computation
of the inner product of two vectors.
Possible distributions: cyclic and block.
Data distribution leads naturally to work distribution.
Communication is needed to add local inner products into a global
inner product.
Single Program, Multiple Data (SPMD) approach:
we don't want to write 400 different programs for a machine with
400 processors.
Onesided communication is a great invention
(in parallel computing, at least;).
Computer Laboratory in adjacent rooms BBG 017 and 020
(from 12.0013.00 hour):
starting with BSPlib.
This is your first handson session with BSPlib. You may have to set up your accounts.
Download the latest version of
BSPedupack ,
my package of educational programs that
teaches how to use BSP.
Solve Exercise 1.3: try to run the benchmark,
exploring your parallel environment: your own laptop
and Cartesius. Change puts into gets.
If you own a laptop, bring it to class,
so we can install MulticoreBSP for C and check your software
Interesting links
Wednesday September 25, 2013. Room 611 Hans Freudenthal building
BSPlib, the Bulk Synchronous Parallel library
(PSC section 1.4)
Benchmarking
(PSC sections 1.51.7)
Interesting links:
Exercise class
Exercise 1.2 from the book.
Wednesday October 2, 2013. Room 611 Hans Freudenthal building
Sequential LU Decomposition
(PSC sections 2.12.2)
Parallel LU Decomposition
(PSC section 2.3)
Exercise class
Answers to Exercise 1.2.
Starting Exercise 1.7 from PSC. Design a parallel algorithm
for prime number generation.
Write your first parallel
program.
This program should generate all prime numbers below 1,000,000 in parallel.
Choosing the right distribution is the name of the game.
Hand in a report on exercise 1.7 before the deadline,
Wednesday October 23, 10.15 hour.
Interesting links:
Wednesday October 9, 2013. Room 611 Hans Freudenthal building
Small examination (10.1511.15 hour)
Test of 60 minutes on Chapter 1 of PSC. It is a closedbook examination!
Midterm Exam Parallel Algorithms 2012
Repeat Midterm Exam Parallel Algorithms 2012
Midterm Exam Parallel Algorithms 2013
Repeat Midterm Exam Parallel Algorithms 2013
TwoPhase Broadcasting
(PSC section 2.4)
Experiments with bsplu
(PSC sections 2.52.6)
Wednesday October 16, 2013. Room BBG 017
Sequential Recursive Fast Fourier Transform (FFT)
(PSC sections 3.13.2) A wonderful algorithm.
Computer Laboratory in adjacent rooms BBG 017 and 020
in the Buys Ballot Building (from 11.1513.00 hour)
Continue with solving Exercise 1.7:
write a parallel program for generating prime numbers.
Run it on Cartesius. Get familiar with the interactive and batch mode.
 Vimtutor,
found to be useful if you want to learn some vi editor commands
on the Linux system.
Wednesday October 23, 2013. Room 611 Hans Freudenthal building
Sequential Nonrecursive FFT
(PSC section 3.3)
Parallel Fast Fourier Transform
(PSC section 3.4)
Exercise class
Choose a new assignment. Possibilities:
solve Exercise 2.5 on Cholesky factorisation;
solve Exercise 2.7 on Strassen matrix multiplication;
solve Exercise 3.9 on computing π in parallel
(the current world record
by Kondo and Yee is 10 trillion digits);
write a parallel program for counting selfavoiding walks (SAWs) on a 2D or 3D lattice.
(The current
world record
for SAWs on the 3D cubic lattice is 36 steps.)
Requests for different topics will be considered.
Please try to discuss the chosen topic with teacher before midNovember.
Wednesday October 30, 2013. Room 611 Hans Freudenthal building
Repeat exam (10.1511.15 hour)
Repeat test of 60 minutes on Chapter 1. Only if you failed
the first examination or want to improve your grade.
(The highest grade of the two counts.)
Weight Reduction (from 11.15 hour onwards)
(PSC section 3.5)
Program bspfft
(PSC section 3.6)
Wednesday November 6, 2013. Room 611 Hans Freudenthal building
Guest lecture AlbertJan Yzelman (KU Leuven) from 10.1512.00 hour. Highperformance sparse matrix–vector multiplication
on sharedmemory architectures
Discussion of various sharedmemory parallel programming interfaces,
such as Cilk, OpenMP, Pthreads, and MulticoreBSP for C.
Using the SpMV multiplication as an example,
we will exploit the specific features of sharedmemory architectures to obtain stateoftheart performance.
Slides of the lecture
AlbertJan Yzelman is designer of the MulticoreBSP for C library
that we are using.
Exercise class (12.0013.00 hour)
Work on your new assignment. Ask anything you want to know about MulticoreBSP for C (or Java).
Wednesday November 13, 2013. Room 611 Hans Freudenthal building
Sequential sparse matrixvector multiplication
(PSC section 4.1)
Sparse matrices and their data structures
(PSC section 4.2)
Exercise class
Work on your new assignment.
Wednesday November 20, 2013. Room 611 Hans Freudenthal building
Parallel sparse matrixvector multiplication
(PSC section 4.3)
Cartesian matrix distributions
(PSC section 4.4)
No exercise class today
(Because of meeting Mathematics and Medicine in UMC.)
Interesting links:
Wednesday November 27, 2013. Room 611 Hans Freudenthal building
Mondriaan sparse matrix distribution
(PSC section 4.5)
New developments in 2D sparse matrix distribution
Movies of matrix partitioning, improvements in quality and speed
by the mediumgrain method (Mondriaan 4.0).
Exercise class
Discussion of prime number sieve, twin primes, Goldbach.
Work on your new assignment.
Interesting links:

Wired Article on twin primes problem, 22 November 2013.
Twin gap down from 70 million to 600, closing in on 2.
Wednesday December 4, 2013. Room BBG 017
Vector distribution
(PSC section 4.6)
Laplacian matrices
(PSC section 4.8)
Exercise/Laboratory class
Exercise 4.2, finding the optimal distribution over 8 processors
for a 12 by 12 grid. Test the program of your second assignment.
Interesting links:

Sinterklaasgedicht door Wiebe Haanstra (in Dutch),
poem about the frustrations of parallel programming
Wednesday December 11, 2013. Room BBG 017
Parallel graph matching
Computer Laboratory from 11.1513.00 hour.
Please fill out the
Mastermath digital evaluation form FALL 2013
Deadline second assignment January 20, 2014, 17.00 hour.
Hand in a report before the deadline.
Note that it has been extended by a week!
Please use the batch system of Cartesius and not the interactive system
for your test runs. The interactive system is only for development.
Second assignment has been graded now (February 20, 2014).
Please contact me to get your grade and discuss your work with me.
Interesting links:
Wednesday December 18, 2011. Room 611 Hans Freudenthal building.
Message Passing Interface (MPI1)
(PSC sections C.1, C.2.1C.2.4)
Message Passing Interface (MPI2)
(PSC Section C2.5, C3, C4)
Other courses that teach the BSP model (among others)

Parallel computing (BKULH03F9A)
at KU Leuven, Belgium, by AlbertJan Yzelman.

Efficient Parallel Algorithms (CS329)
at University of Warwick, UK,
by Alex Tiskin.

Parallel and Distributed Programming (CS236370)
by Assaf Schuster,
at the Technion, Haifa, Israel.

High Performance and Parallel Computing (22C177)
by Suely Oliveira,
at the University of Iowa.

Parallel computing (in Polish)
by Przemek Stpiczynski at the Maria CurieSklodowska University,
Lublin, Poland.
Last update of this page: February 20, 2014
Back to homepage of Rob Bisseling