Select Page

# Measuring and improving cache performance

## Cache Performance

• Average memory access time is a useful measure to evaluate the performance of a memory-hierarchy configuration.

• It tells us how much penalty the memory system imposes on each access (on average).

• It can easily be converted into clock cycles for a particular CPU.

• But leaving the penalty in nanoseconds allows two systems with different clock cycles times to be compared to a single memory system.

## Cache Performance

• There may be different penalties for Instruction and Data accesses.
• In this case, you may have to compute them separately.

• This requires knowledge of the fraction of references that are instructions and the fraction that are data.
• The text gives 75% instruction references to 25% data references.

• We can also compute the write penalty separately from the read penalty.

• This may be necessary for two reasons:
• Miss rates are different for each situation.
• Miss penalties are different for each situation.

• Treating them as a single quantity yields a useful CPU time formula:

## An Example

• Compare the performance of a 64KB unified cache with a split cache with 32KB data and 16KB instruction .

• The miss penalty for either cache is 100 ns, and the CPU clock runs at 200 MHz.

• Don’t forget that the cache requires an extra cycle for load and store hits on a unified cache because of the structural conflict.

• Calculate the effect on CPI rather than the average memory access time.

• Assume miss rates are as follows (Fig. 5.7 in text):
• 64K Unified cache: 1.35%
• 16K instruction cache: 0.64%
• 32K data cache: 4.82%

• Assume a data access occurs once for every 3 instructions, on average.

## An Example

• The solution is to figure out the penalty to CPI separately for instructions and data.

• First, we figure out the miss penalty in terms of clock cycles: 100 ns/5 ns = 20 cycles.
• For the unified cache, the per-instruction penalty is (0 + 1.35% x 20) = 0.27 cycles.
• For data accesses, which occur on about 1/3 of all instructions, the penalty is (1 + 1.35% x 20) = 1.27 cycles per access, or 0.42 cycles per instruction.
• The total penalty is 0.69 CPI .

• In the split cache, the per-instruction penalty is (0 + 0.64% x 20) = 0.13 CPI.
• For data accesses, it is (0 + 4.82% x 20) x (1/3) = 0.32 CPI.
• The total penalty is 0.45 CPI .

• In this case, the split cache performs better because of the lack of a stall on data accesses.

## Effects of Cache Performance on CPU Performance

• Low CPI machines suffer more relative to some fixed CPI memory penalty.
• A machine with a CPI of 5 suffers little from a 1 CPI penalty.
• However, a processor with a CPI of 0.5 has its execution time tripled !

• Cache miss penalties are measured in cycles, not nanoseconds.
• This means that a faster machine will stall more cycles on the same memory system.

• Amdahl’s Law raises its ugly head again:
• Fast machines with low CPI are affected significantly from memory access penalties.

## Improving Cache Performance

• The increasing speed gap between CPU and main memory has made the performance of the memory system increasingly important.

• 15 distinct organizations characterize the effort of system architects in reducing average memory access time.

• These organizations can be distinguished by:
• Reducing the miss rate.
• Reducing the miss penalty.
• Reducing the time to hit in a cache.

## Reducing Cache Misses

• Components of miss rate: All of these factors may be reduced using various methods we’ll talk about.
• Compulsory
• Cold start misses or first reference misses : The first access to a block can NOT be in the cache, so there must be a compulsory miss.
• These are suffered regardless of cache size.
• Capacity
• If the cache is too small to hold all of the blocks needed during execution of a program, misses occur on blocks that were discarded earlier.
• In other words, this is the difference between the compulsory miss rate and the miss rate of a finite size fully associative cache.
• Conflict
• If the cache has sufficient space for the data, but the block can NOT be kept because the set is full, a conflict miss will occur.
• This is the difference between the miss rate of a non-fully associative cache and a fully-associative cache.
• These misses are also called collision or interference misses.

## Reducing Cache Miss Rate

• To reduce cache miss rate, we have to eliminate some of the misses due to the three C’s.

• We cannot reduce capacity misses much except by making the cache larger.

• We can, however, reduce the conflict misses and compulsory misses in several ways:

• Larger cache blocks
• Larger blocks decrease the compulsory miss rate by taking advantage of spatial locality.

• However, they may increase the miss penalty by requiring more data to be fetched per miss.

• In addition, they will almost certainly increase conflict misses since fewer blocks can be stored in the cache.
• And maybe even capacity misses in small caches.

## Reducing Cache Miss Rate

• Larger cache blocks
• The performance curve is U-shaped because:
• Small blocks have a higher miss rate and
• Large blocks have a higher miss penalty (even if miss rate is not too high).

• High latency , high bandwidth memory systems encourage large block sizes since the cache gets more bytes per miss for a small increase in miss penalty.
• 32-byte blocks are typical for 1-KB, 4-KB and 16-KB caches while 64-byte blocks are typical for larger caches.

## Reducing Cache Miss Rate

• Higher associativity
• Conflict misses can be a problem for caches with low associativity (especially direct-mapped).

• 2:1 cache rule of thumb : a direct-mapped cache of size N has the same miss rate as a 2-way set-associative cache of size N/2.

• However, there is a limit — higher associativity means more hardware and usually longer cycle times (increased hit time).
• In addition, it may cause more capacity misses.

• Nobody uses more than 8-way set-associative caches today, and most systems use 4-way or less.

• The problem is that the higher hit rate is offset by the slower clock cycle time.

## Reducing Cache Miss Rate

• Victim caches
• A victim cache is a small (usually, but not necessarily) fully-associative cache that holds a few of the most recently replaced blocks or victims from the main cache.
• Can improve miss rates without affecting the processor clock rate.

• This cache is checked on a miss before going to main memory.
• If found, the victim block and the cache block are swapped.

## Reducing Cache Miss Rate

• Victim caches
• It can reduce capacity misses but is best at reducing conflict misses.

• It’s particularly effective for small, direct-mapped data caches.
• A 4 entry victim cache handled from 20% to 95% of the conflict misses from a 4KB direct-mapped data cache.

• Pseudo-associative caches
• These caches use a technique similar to double hashing.

• On a miss, the cache searches a different set for the desired block.
• The second ( pseudo ) set to probe is usually found by inverting one or more bits in the original set index .

• Note that two separate searches are conducted on a miss.
• The first search proceeds as it would for direct-mapped cache.
• Since there is no associative hardware, hit time is fast if it is found the first time.

## Reducing Cache Miss Rate

• Pseudo-associative caches
• While this second probe takes some time (usually an extra cycle or two), it is a lot faster than going to main memory.
• The secondary block can be swapped with the primary block on a “slow hit”.

• This method reduces the effect of conflict misses.

• Also improves miss rates without affecting the processor clock rate.

• Hardware prefetch
• Prefetching is the act of getting data from memory before it is actually needed by the CPU.

• Typically, the cache requests the next consecutive block to be fetched with a requested block.
• It is hoped that this avoids a subsequent miss.

## Reducing Cache Miss Rate

• Hardware prefetch
• This reduces compulsory misses by retrieving the data before it is requested.

• Of course, this may increase other misses by removing useful blocks from the cache.
• Thus, many caches hold prefetched blocks in a special buffer until they are actually needed.

• This buffer is faster than main memory but only has a limited capacity.

• Prefetching also uses main memory bandwidth.
• It works well if the data is actually used.
• However, it can adversely affect performance if the data is rarely used and the accesses interfere with `demand misses’.