## Lecture 14: Caching

- · Last time:
  - Branch prediction
  - Issuing multiple instructions in each cycle
- Today:
  - What part of the pipeline have we been glossing over?
    - Memory!!
  - Very important to overall machine performance.

UTCS CS352, S05 Lecture 14

4

## Memory System Overview

- Memory Hierarchies
  - Latency/Bandwidth/Locality
  - Caches
    - Principles why does it work
    - · Cache organization
    - · Cache performance
    - Types of misses (the 3 Cs)
  - Main memory organization
    - · DRAM vs. SRAM
    - · Bank organization
    - Tracking multiple references
  - Trends in memory system design

- Logical Organization
  - Name spaces
  - Protection and sharing
  - Resource management
    - virtual memory, paging, and swapping
  - Segmentation
  - Capability-based addressing

UTCS CS352, S05 Lecture 14

# The Memory Bottleneck

- Typical CPU clock rate
  - 3 GHz (0.33 ns cycle time)
- Typical DRAM access time
  - 30ns (about 100 cycles)
- Typical main memory access
  - 70ns (210 cycles)
    - DRAM (30), precharge (10), chip crossings (15), overhead (15).
- Our pipeline designs assume 1 cycle access
- Average instruction references
  - 1 instruction word
  - 0.3 data words

- · This problem gets worse
  - CPUs get faster
  - Memories get bigger
- Memory delay is mostly communication time
  - reading/writing a bit is fast
  - it takes time to
    - · select the right bit

3

- route the data to/from the bit
- Big memories are slow
- Small memories can be made fast

UTCS Lecture 14 CS352, S05











# Program Behavior

- · Locality depends on type of program
- · Some programs 'behave' well
  - small loop operating on data on stack
- · Some programs don't
  - frequent calls to nearly random subroutines
  - traversal of large, sparse data set
    - $\cdot$  essentially random data references with no reuse
- Most programs exhibit some degree of locality

UTCS Lecture 14 CS352, S05





# Two kinds of "fast & small" memory

- · Programmer manages it manually
  - Sometimes called a "scratchpad" memory
  - CELL processor uses this approach
- Hardware manages it automatically
  - Invisible to programmer
  - Referred to as a "cache"
  - Most CPUs use this approach
  - Easy for programmers; Hard for hardware

UTCS Lecture 14 12 CS352, S05

# How does hardware keep track of what's in the fast memory (cache)?

- · How does it know what's in the cache 'now'?
- · How does it decide what to add to the cache?
- How does it decide what to remove from the cache?
- How does it keep the cache consistent with the off-chip memory?

UTCS CS352, S05 Lecture 14

13

## Cache Organization



- Where does a block get placed?
- · How do we find it?
- Which one do we replace when a new one is brought in?
- What happens on a write?

UTCS CS352, S05 Lecture 14













# Taking advantage of Spatial Locality

- Instead of each block in cache being just 1 word, what if we made it 4 words?
- When we get our 1 word instruction or 1 word of data from memory to put in the cache, get the next 3 as well, because they are likely to be used soon!
- Need to add a way to choose which of the 4 words in the block we want when we go to cache... called block offset.

UTCS CS352, S05 Lecture 14

21

#### How Do We Find a Block in The Cache?

- Our Example:
  - Main memory address space = 32 bits (= 4GBytes)
  - Block size = 4 words = 16 bytes
  - Cache capacity = 8 blocks = 128 bytes



- Valid bit ⇒ is cache block good?
- index  $\Rightarrow$  which set
- tag ⇒ which data/instruction in block
- block offset  $\Rightarrow$  which word in block
- # tag/index bits determine the associativity
- · tag/index bits can come from anywhere in block address

UTCS CS352, S05 Lecture 14





#### Set Associative Cache

- S sets
- A elements in each set
  - A-way associative
- In the example, S=4, A=2
  - 2-way associative 8-entry cache
- All of main memory is divided into 5 sets
  - All addresses in set N map to same set of the cache
    - · Addr = N mod S
    - · A locations available
- Shares costly comparators across sets

- · Low address bits select set
  - 2 in example
- High address bits are tag, used to associatively search the selected set
- Extreme cases
  - A=1: Direct mapped cache

25

- S=1: Fully associative
- · A need not be a power of 2

UTCS Lecture 14 CS352, S05



# Questions to think about

- As the block size goes up, what happens to the miss rate?
- · ... what happens to the miss penalty?
- · ... what happens to hit time?
- As the associativity goes up, what happens to the miss rate?
- · ... what happens to the hit time?

UTCS CS352, S05 Lecture 14

27

#### Next time

- More on caches
  - How to analyze and improve their performance
  - No new reading
- · Homework #4 due

UTCS CS352, S05 Lecture 14