Multiprocessors I & II
======================
Lecture 21, 4/24/2007

Auxiliary materials:
  - Lecture20 slide set w/processor die plots
  - Chapter 9 (from CD) of textbook, esp. pg. 9-12 and 9-20
  - Hot Chips 2006 slides about Niagara

1) [previous class] Went through slides:
    - mostly examples of multiprocessor machines of different sizes
      and organizations.
      . multi-chip multiprocessors
      . single-chip multiprocessors
        *** We're going to see lots of these in the future, now that
	    we can fit many of them on a die and ILP parallelism is
	    reaching its limits.

2) What are limits to single-processor performance?
   (for now, define single-processor as one program counter)

   begs question:
   How do we improve performance of uniprocessors?
   - pipelining
     but eventually branch prediction and latch overheads kill you
   - multi-issue machines
     but eventually you run out of parallel work to do within a
     small window.  problems due to branch prediction and cache
     misses.

   More fundamentally: limits to fine-grained instruction level
   parallelism.

3) Reasons for parallel processing have changed.
   Traditionally:
     - you want more performance than you can get from a single-chip
       processor.
       e.g. cluster of PCs
   Now:
     - you want maximum performance from *one* chip
     - parallelism is the most effective way to do this

4) Some kinds of multiprocessing?
   - Diagram of SIMD machine (shared PC and instruction decode)
   - Diagram of MIMD machine (PC and decode for each processor)
   - Contrast with ILP parallelism (one PC, but each instruction only
      executed on one piece of data, unlike SIMD machine)

[Use Niagara 2 slides from Hot Chips 2006 to illustrate multi-core,
 *multi-threaded* machine]

Draw figure showing MIPS 5-stage pipeline, but with 4 program
counters feeding the pipeline, and four register files.

5) Examples of uses:
   - ATM transaction processing system
   - Web server
   - Graphics processors
   - OS + application#1 + application#2

6) Writing parallel programs can be hard...
   - static-page web serving -- easy
   - parallelizing the final project assignment (cache sim) -- hard

7) Granularity of parallelism:

   Granularity		Example
   -----------		-------
   bit			32-bit adder
   instruction (ILP)	superscalar processor or out-of-order processor
   thread		Data parallelism - update color of every pixel
   task			MSWord + Web browser running on two-processor machine

8) Limits to parallelism:
   - Amdahl's Law...
     What percentage of program is parallelizable?
     What if I perfectly parallelized that?

9) Communication models
   Asked question: how do two program communicate?
   Message passing:
     - e.g. network sockets
     - e.g. special API like MPI
   Shared memory:
     - Showed how page tables could implement this on operating system,
       for allowing two processes to share memory
     - Briefly mentioned issues with cache consistency
     - non-uniform memory access vs. uniform memory access

Example of parallel program
---------------------------
Sum of an array of number, pg. 9-11 from CD

. show a naive implementation (lots of false sharing)
. show better implementation
     show graphically how it works

Some synchronization models
---------------------------
* Barrier -- everyone waits until they get there, then everyone continues
* Lock -- exclusive access (mutual exclusion)
* Producer/consumer -- pairwise ordering

Bus-based cache coherence
-------------------------

* Figure of bus-based multiprocessor

* Talk about what happens if writes happen in multiple caches -- bad!

* Use bus to coordinate writes

* Two options:
  - write-update -- all writes go to bus
  - write-invalidate -- writing processor must acquire exclusive copy
		        prior to writing
			Conserves bus bandwidth, just like write-allocate cache

* MESI protocol
  M = modified
  E = exclusive
  S = shared
  I = invalid

* Go through example of a cache line, and its state for each cache
  - first its written by one processor
  - then read by lots of processors
  - then written by a different processor

* Multiprocessors also need support for atomic memory operations:
  e.g. test if 0, if so, set equal to 1.

Synchronization
---------------
Atomic pair, pg. 9-20 from CD:
   . Load locked
   . Store conditional