Lecture 13: Branch prediction, static & dynamic multiple issue, OOO
===================================================================

Branch prediction
=================

* Simplest strategy:
  - predict taken
  - invalidate if wrong
  - this has to happen to some degree anyway if nothing else is done

* Static branch prediction
  * Look at simple code:


     for (i=0; i<10; i++) {
       foo;
     }

     loop:  subi R2, R3, #10
	    bgez done
	    foo
	    addi R3, R3, #1
	    j loop
     done: 


  - hint bit in instruction
  - or, backward branches are taken
  - predict accordingly

* Dynamic branch prediction
   * Look at simple code:

      if (flag) {

      } else {

      }

      In assembly this looks like:

            BEQ R1, R2, else
	    do_something;
	    j skip
      else: do_something2;
      skip: ...

      What can tell us more about how this branch is taken?
      What if flag is set at the beginning of the program?

   a) Use history of this branch
       - last time -- leads to double-mispredict
       - last two times -- leads to better behavior
       - table indexed by low bits of program counter
       * "Branch History Buffer"

   b) Use history of this branch, *and* previous branch!
       - table index by low bits of program counter, plus result
         of previous branch
       * "Branch Target Buffer"

* What about destination address?
  We need it immediately.
  But takes two cycles:  PC -> Ifetch/Decode -> Add PC to Offet
  Instead:
    Use low bits of PC as index into table.
    Fetch destination from table.
    This is just:
       PC -> Table
       one less step; no need to wait for decode

Multiple Issue
==============
[Superscalar, VLIW]

Use slides.

OOO, Register renaming
======================
Example of out-of-order execution and register renaming
==========

This lecture was given on the chalkboard, so you had to "be there"
to get the full benefit.

First, I described the difference between a superscalar processor
(multiple execution units) and an out-of-order processor.  Out of
order execution is particularly important to allow the processor
to do useful work during a cache miss by a load instruction.

Then, we found RAW, WAW, and WAR dependencies in the following code:

L.D    F6, 34(R2)
L.D    F2, 45(R3)
MULT.D F0, F2, F4
SUB.D  F8, F6, F2
DIV.D  F10, F0, F6
ADD.D  F6, F8, F2

Next, we renamed all of the architectural registers (Fn, Rn) in this
code to physical registers (Pn).  We did this by moving forward through
the code and updating the table that maps architectural registers to
physical registers.

I also explained the general organization of an out-of-order processor,
with a better version of the following figure:


  IN ORDER              Instruction Fetch
  PROCESSING                  |
  (architectural
   registers)                 |
  ...........................\|/...................
                         . Reservation Stations
  OUT-OF-ORDER           . Execution units (ALU's)
  PROCESSING             . Reorder Buffer
  (physical registers)
  .................................................
                           Commit
  IN ORDER                   |
  PROCESSING                \ /
  (architectural
   registers)

Finally, I noted that I was glossing over some details of real out-of-order
processors, but that I primarily wanted to make sure you understood the
following concepts:
   . how dependencies between instructions restrict which instructions
     can be executed out-of-order
   . how register renaming is used to eliminate false (WAW/WAR) dependencies
   . the high-level organization of an out-of-order processor