Today, parallel computers are appearing on our desktops. The advent of dual-core and quad-core computers and the expected increase in the number of cores in the coming years, inevitably will cause a major change in our approach to software, such as the software we use in scientific computations. Parallel computers drastically increase our computational capabilities and thus enable us to model more realistically in many application areas.
To make efficient use of parallel computers, it is necessary to reorganise the structure of our computational methods. In particular, attention must be paid to the division of work among the different processors solving a problem in parallel and to the communication between them. Suitable parallel algorithms and systems software are needed to realise the capabilities of parallel computers.
We will discuss extensively the most recent developments in the area of parallel computers, ranging from multi-core desktop PCs, to clusters of PCs connected by switching devices, to massively parallel computers with distributed memory such as our national supercomputer Cartesius at SurfSARA in Amsterdam.
The following subjects will be treated:
Recommended: Bring your own device! In certain weeks, marked below by BYOD, it is helpful to bring your own laptop, if you possess one, for use during computer laboratory sessions. Please install the MulticoreBSP library. On Macs and Linux computers this is straightforward. On Windows machines you need a UNIX emulator which runs Pthreads, and it is more difficult to get the software running. If you do not possess a laptop, perhaps your project partner does (you are allowed to work in pairs). If this fails, we will find another solution for you.
You will get access to the national supercomputer Cartesius.
All students from Utrecht should submit reports electronically in PDF format by email, to the teacher, Rob Bisseling, with a cc to the teaching assistant, Abe Wits. No hardcopy is needed. All homework must be handed in as hardcopy though. All students must use LaTeX for the assignments; handwritten is OK for the homework. For the first assignment, you will have to submit your parallel program to an automated testing system, which we will become available early October. An email with instructions how to do this has been sent to you.
The first chapter of the book is freely available from the publisher's website, see "Sample material".
In addition, all my slides and WWW links to background material are provided through this course page. (Pointers to other relevant links are welcome!)
Last update of the files: November 6, 2014. The slides are complete and they cover every section of the book. They have been converted to the Beamer macros in September 2012, and have been made up to date in the process. I am adding experimental results on recent computer architectures as we go. The old version (from 2007) in Prosper can be found here.
You can also obtain the PDFs separately. Each PDF file represents one lecture of about 45 minutes.
Chapter 1: sections
Chapter 2: sections 1-2 3 4 5-6
Chapter 3: sections 1-2 3 4 5 6 7
Chapter 4: sections 1 2 3 4 5 6 7 8 9 10
Appendix C: Programming in BSP style using MPI, 1 (MPI-1) 2 (MPI-2).
(PSC section 1.2) What is a parallel computer? Why do we compute in parallel? The Bulk Synchronous Parallel (BSP) model by Valiant comprises an abstract machine architecture, a framework for developing algorithms, and a cost function for analysing the run time of algorithms. The BSP architecture is a set of processor-memory pairs connected by a black box communication network. BSP algorithms consist of computation supersteps and communication supersteps. Communication is costed as h-relations. The BSP cost of an algorithm is expressed in machine parameters p, g, l. The computing rate is r flop/s. Motivation of the cost model: bottlenecks at the entrance and exit of the communication network.
(PSC section 1.3) My First Parallel Algorithm: the computation of the inner product of two vectors. Possible distributions: cyclic and block. Data distribution leads naturally to work distribution. Communication is needed to add local inner products into a global inner product. Single Program, Multiple Data (SPMD) approach: we don't want to write 400 different programs for a machine with 400 processors. One-sided communication is a great invention (in parallel computing, at least;).
In class: Exercise 1.2. Please hand in Exercise 1.1 as Homework 1 (HW1) on September 16, 2015.
(PSC section 1.4)
(PSC sections 1.5-1.7)
(PSC sections 2.1-2.2)
(PSC section 2.3)
Hand in a report on exercise 1.7 and submit your program to the checking system, before the deadline, Wednesday October 14, 10.00 hour. Note that you can work in pairs. The format for your program is specified in the primes program specification. Useful other files: primes_example_0.out, primes_example_1.out, primes_example_2.out, For Goldbach, note that of course you need to find only one prime pair (p,q) with p+q=2k for every even number 2k to check the conjecture.
(PSC section 2.4)
(PSC sections 2.5-2.6)
(PSC section 4.1)
(PSC section 4.2)
Choose a new assignment. Possibilities: solve Exercise 1.5 on data compression; solve Exercise 2.5 on Cholesky factorisation; solve Exercise 2.7 on Strassen matrix multiplication; develop a communication-avoiding LU decomposition with multiple -rank updates; write a parallel program for maximum-cardinality matching in a bipartite graph; write a parallel program for counting self-avoiding walks (SAWs) on a 2D or 3D lattice using Epiphany BSP on the Parallella board (The current world record for SAWs on the 3D cubic lattice is 36 steps.) Requests for different topics will be considered. Please try to discuss the chosen topic with the teacher before mid-November.
Start working on Homework 2 (HW2) on Chapter 2, Exercise 2.3(a) on optimising the row swap in LU decomposition. Hand it in on October 28.
(PSC section 4.3)
(PSC section 4.4)
(PSC section 4.5)
Slides of the guest lecture
Abstract: The sparse matrix partitioning problem arises in parallel sparse matrix-vector multiplications, where one needs to distribute work evenly between the available processors while minimizing the amount of needed communication between processors. Given a sparse matrix A, we aim to find a partitioning of the matrix nonzeros, such that each part contains at most a certain number of nonzeros, and the number of different parts that own nonzeros in each row and column is minimized. The problem can be written as a hypergraph partitioning problem, and is NP-hard to solve. Therefore, in practice, several heuristic methods are used to find good solutions for specific instances. One-dimensional heuristic solvers partition the sparse matrix only in the row or column direction and are computationally efficient, but typically produce sub-optimal results. On the other hand, two-dimensional heuristic solvers produce better results, but can be prohibitively slow in practice.
In this talk, I will present work of Rob H. Bisseling and me on two methods for solving the sparse matrix partitioning problem. The first method is a heuristic solver , based on translating the sparse matrix A to a different sparse matrix B with a specific structure. This structure allows us to obtain very good results by partitioning B with a fast one-dimensional heuristic and translating the resulting partitioning of B to a partitioning of A. The method produces two-dimensional partitionings of A, while retaining the computational efficiency of standard one-dimensional heuristic solvers. This method, called the medium-grain method, is now the default in the Mondriaan sparse matrix partitioning package . The second method is an exact solver , i.e. one that finds the optimal partitioning for a given matrix. Since the problem is NP-hard, we accept much longer computation times for an exact solver, and limit the size of the problems we expect to solve. However, analyzing optimal partitionings of sparse matrices might help improve current heuristics. The exact method is based on a branch-and-bound framework, with several lower bounds for partial solutions. We show that the exact method is able to find the optimal solution for the majority of small test matrices, and even for a few larger matrices with special structures. This method will soon become available as the tool MondriaanOpt within the Mondriaan package.References:
(PSC section 4.9)
Discussion of solution prime number sieve + feedback on computer program results. Announcement of winner of best program competition. Homework HW3: Exercise 4.2, but with a 12 x 16 grid instead of 12 by 12. If you use LaTeX, try Tikz to produce a nice figure. Hand in November 25.
Slides of the guest lecture
Abstract: Distributed-memory parallel programming has been the main area of focus for the Bulk Synchronous Parallel (BSP) model. This was true in the past, and thanks to contemporary popular frameworks such as MapReduce, Spark, and Pregel, remains true now. The realm of shared-memory computing, obviously becoming more and more important with the advent of manycore processors, can also benefit from the structured way of parallel programming that BSP imposes.
Using the hard problem of sparse matrix--vector (SpMV) multiplication as an example both relevant to scientific computing as well as to graph computing, this lecture explores the use of BSP in the design of high-performance algorithms. The SpMV multiply is hard because it requires irregular memory access and suffers from various tendencies in modern architectures: low bandwidth-per-core, increasingly non-uniform memory hierarchies. This makes use of unbuffered communication mandatory, requires careful data placement, and perhaps even requires the adaptation of the flat BSP model into a hierarchical and memory-aware bridging model.
The BSP approach is compared to other paradigms in parallel computing, such as parallel-for, fork-join, and generic fine-grained computing. The design of algorithms in the hierarchical Multi-BSP model is illustrated as well.
Slides of the lecture
Abstract: The Epiphany chip is a 16 core chip with only 32 kb RAM per core, and incomplet e documentation. It is also the world's most energy efficient processor! How to write parallel C code for such a platform? What if you want to reuse your code on a supercomputer and your laptop? We will discuss one possible solution: BSP, a parallel programming model. Learn about programming on the Parallella chip, the Epiphany architecture, BSP, streaming data structures and more!
With two friends, I developed Epiphany BSP, check http://codu.in/ebsp/ for more information.
The Parallella is like a 100$, overpowered, 18 core raspberry pi.
Please fill out the Mastermath digital evaluation form. Use last year's form if needed!