Multiprocessors I & II ====================== Lecture 21, 4/24/2007 Auxiliary materials: - Lecture20 slide set w/processor die plots - Chapter 9 (from CD) of textbook, esp. pg. 9-12 and 9-20 - Hot Chips 2006 slides about Niagara 1) [previous class] Went through slides: - mostly examples of multiprocessor machines of different sizes and organizations. . multi-chip multiprocessors . single-chip multiprocessors *** We're going to see lots of these in the future, now that we can fit many of them on a die and ILP parallelism is reaching its limits. 2) What are limits to single-processor performance? (for now, define single-processor as one program counter) begs question: How do we improve performance of uniprocessors? - pipelining but eventually branch prediction and latch overheads kill you - multi-issue machines but eventually you run out of parallel work to do within a small window. problems due to branch prediction and cache misses. More fundamentally: limits to fine-grained instruction level parallelism. 3) Reasons for parallel processing have changed. Traditionally: - you want more performance than you can get from a single-chip processor. e.g. cluster of PCs Now: - you want maximum performance from *one* chip - parallelism is the most effective way to do this 4) Some kinds of multiprocessing? - Diagram of SIMD machine (shared PC and instruction decode) - Diagram of MIMD machine (PC and decode for each processor) - Contrast with ILP parallelism (one PC, but each instruction only executed on one piece of data, unlike SIMD machine) [Use Niagara 2 slides from Hot Chips 2006 to illustrate multi-core, *multi-threaded* machine] Draw figure showing MIPS 5-stage pipeline, but with 4 program counters feeding the pipeline, and four register files. 5) Examples of uses: - ATM transaction processing system - Web server - Graphics processors - OS + application#1 + application#2 6) Writing parallel programs can be hard... - static-page web serving -- easy - parallelizing the final project assignment (cache sim) -- hard 7) Granularity of parallelism: Granularity Example ----------- ------- bit 32-bit adder instruction (ILP) superscalar processor or out-of-order processor thread Data parallelism - update color of every pixel task MSWord + Web browser running on two-processor machine 8) Limits to parallelism: - Amdahl's Law... What percentage of program is parallelizable? What if I perfectly parallelized that? 9) Communication models Asked question: how do two program communicate? Message passing: - e.g. network sockets - e.g. special API like MPI Shared memory: - Showed how page tables could implement this on operating system, for allowing two processes to share memory - Briefly mentioned issues with cache consistency - non-uniform memory access vs. uniform memory access Example of parallel program --------------------------- Sum of an array of number, pg. 9-11 from CD . show a naive implementation (lots of false sharing) . show better implementation show graphically how it works Some synchronization models --------------------------- * Barrier -- everyone waits until they get there, then everyone continues * Lock -- exclusive access (mutual exclusion) * Producer/consumer -- pairwise ordering Bus-based cache coherence ------------------------- * Figure of bus-based multiprocessor * Talk about what happens if writes happen in multiple caches -- bad! * Use bus to coordinate writes * Two options: - write-update -- all writes go to bus - write-invalidate -- writing processor must acquire exclusive copy prior to writing Conserves bus bandwidth, just like write-allocate cache * MESI protocol M = modified E = exclusive S = shared I = invalid * Go through example of a cache line, and its state for each cache - first its written by one processor - then read by lots of processors - then written by a different processor * Multiprocessors also need support for atomic memory operations: e.g. test if 0, if so, set equal to 1. Synchronization --------------- Atomic pair, pg. 9-20 from CD: . Load locked . Store conditional