Cache Optimization on RISC-V Architecture

Dec 15, 2024 · 10 min read · Hardware Architecture Cache Design RISC-V Performance Benchmarking Memory Subsystem Hardware Simulation Replacement Policies ·

Share on:

Overview

Course: ECE 5504 - Computer Architecture | Semester: Fall 2024

Technical Focus: Memory Hierarchy Design, Replacement Algorithms, Performance Modeling

Problem Statement & Motivation

Cache design involves fundamental trade-offs between performance and hardware complexity. Least Recently Used (LRU) replacement achieves optimal miss rates asymptotically but requires O(n) comparisons per access and O(n) state maintenance in n-way associative caches. Pseudo-LRU (PLRU) reduces complexity to O(log n) with tree-based bit vectors—but at what performance cost?

This project addresses a critical question: For modern RISC-V architectures, is the 40% additional hardware complexity of LRU justified by performance improvements on real workloads?

Research Context

LRU Optimality: Theoretically optimal for stack distance models, but real caches have prefetchers, write-backs, and irregular access patterns
PLRU Prevalence: Used in Intel, AMD, ARM designs due to lower complexity—yet few systematic comparisons exist on RISC-V
Workload Specificity: Cache policy effectiveness varies dramatically by access pattern regularity
Simulation Fidelity: Boom Core provides realistic out-of-order execution; results more meaningful than trace-based analysis

System Architecture & Design

Simulation Platform: Berkeley Out-of-Order Machine (Boom)

Boom Core is a parameterized out-of-order RISC-V processor generator enabling precise cache behavior modeling:

ISA: 64-bit RISC-V RV64I
Microarchitecture: 3-stage pipeline, 8-way superscalar execution
Cache Hierarchy: Separate 32KB L1-I/D (8-way), unified 256KB L2 (8-way), 8MB L3
Branch Predictor: Two-bit saturating counters (global history)
Prefetcher: Stride-based with stream-based filtering

Cache Replacement Policies

LRU Implementation:

1struct LRU_Entry {
2    bit valid[8];           // Valid bits per way
3    bit tag[8][52];         // Address tags
4    bit lru_order[8];       // Full 3-bit per-way order
5};
6// Per-access cost: 3 comparisons + 3 mux levels + 1 update (8 mux)
7// Total: ~450 gates per 8-way

PLRU Implementation (Proposed):

1struct PLRU_Entry {
2    bit valid[8];           // Valid bits per way  
3    bit tag[8][52];         // Address tags
4    bit plru_tree[7];       // Binary tree: 7 bits for 8 ways
5};
6// Per-access cost: log2(8)=3 tree traversals + 1 bit flip
7// Total: ~120 gates per 8-way (73% reduction)

PLRU Algorithm: Maintain pseudo-tree where each node tracks which subtree to evict. On miss, traverse to leaf (eviction target), flip bits on path back to root.

Key Design Decisions

Workload Selection: Chose FFT (regular access pattern) vs SHA-256 (irregular) to bracket access pattern spectrum
Simulation Length: 500M instructions per benchmark to achieve statistical convergence
Trace Sampling: Implemented 10% sampling to reduce 48h simulations → 4.8h
Confidence Intervals: Used bootstrap resampling with 95% CI for all metrics

Experimental Evaluation

Methodology

Benchmark Suite:

FFT: 1024-point Cooley-Tukey, O(n log n) with 2.4 cache misses/iteration
SHA-256: 64-round cryptographic hash, ~85% cache misses (adversarial access pattern)
Synthetic Blend: 70% FFT-like + 30% SHA-256-like access pattern

Metrics:

L1 hit rate (%) — primary effectiveness metric
L2/L3 hit rates — hierarchy interaction
Instructions per cycle (IPC) — end-to-end performance
Normalized memory stall time

Results

Metric	LRU	PLRU	Diff	Significance
FFT L1 Hit Rate	87.2%	85.9%	-1.3pp	Low
SHA-256 L1 Hit Rate	48.3%	46.1%	-2.2pp	Moderate
Blend L1 Hit Rate	68.9%	67.1%	-1.8pp	Low
L2 Hit Rate	92.1%	91.8%	-0.3pp	Negligible
IPC (FFT)	2.14	2.11	-1.4%	~1 cycle/100
IPC (SHA-256)	1.32	1.29	-2.3%	~1 cycle/50
Hardware Gates	12500	7300	-41.6%	40% area savings
Power (estimated)	1.24W	0.88W	-29%	Dynamic power

Analysis

Finding 1: Workload-Dependent Benefits

FFT regular access → both policies achieve 85%+ L1 hit rates; PLRU performance gap minimal
SHA-256 irregular access → PLRU performance gap grows to 2.2pp but from already-low 48.3%
Implication: For cache-friendly workloads, PLRU sufficient; for adversarial workloads, both policies saturate at network bandwidth

Finding 2: Hierarchy Masks Policy Differences

L2/L3 hit rates nearly identical (91.8% vs 92.1%)
Misses evenly distributed across all workloads
Policy only affects L1 miss pattern; downstream hierarchy absorbs variance

Finding 3: IPC Impact Modest

1.4% IPC degradation for FFT, 2.3% for SHA-256
Noise floor ~0.5-1% from simulation variability
Translation: ~0.5-1 wasted cycles per 100 instructions

Technical Contributions

1. Trace-Based Simulation Framework

Implemented 10x faster simulation via instruction trace sampling
Maintains statistical fidelity: replayed traces show <1% divergence from full simulation
Enables rapid policy prototyping

2. PLRU Correctness Proof

Verified PLRU algorithm for all associativities (4, 8, 16-way)
Proved property: "LRU entry guaranteed to be victim within k=⌈log₂(n)⌉ accesses"
Demonstrated PLRU bit-flip semantics preserve LRU approximation

3. Access Pattern Classification

Developed metric: "regularity index" = (stride diversity) / (address space)
Classified benchmarks: FFT (high regularity: 0.9), SHA-256 (low: 0.15)
Correlated regularity with policy performance gap

Implementation Details

Hardware Modifications

PLRU Tree Generator (Chisel HDL)
- Parameterized generator for n-way associativity
- Produces optimal tree structure minimizing path length
- Auto-generates validity-bit tracking
Performance Counter Instrumentation
- Added 6 new counters: per-level hits, misses, victim policies
- Zero-overhead monitoring via existing debug interface
- 64-bit counters with 1μs resolution
Simulation Hooks
- Modified Boom L1 cache model for policy swapping
- Instrumented replacement decisions with callbacks
- Logged: address, victim way, policy choice, outcome

Benchmarks Compiled

1# FFT: Custom C implementation
2riscv64-unknown-elf-gcc -O3 -march=rv64i fft.c -o fft
3
4# SHA-256: OpenSSL crypto library
5openssl enc -aes-256-cbc -in data.bin -out enc.bin
6
7# Synthetic: Custom blend generator
8./blend_gen --fft-ratio 0.7 --seed 42 > trace.vcd

Results Interpretation & Trade-offs

When LRU Justifies Complexity

High-frequency caches: L1 is critical path → 1-2% IPC gain matters
Irregular workloads: Cryptographic, compression algorithms with pseudo-random access
Power-neutral designs: If static power dominates (e.g., embedded processors)

When PLRU Suffices

Cache-friendly workloads: FFT, matrix multiplication, stream processing
Large L2/L3 caches: Cache hierarchy masks L1 policy via buffering effect
Area/Power-constrained: Mobile, IoT, AI accelerators prioritize efficiency
Modern prefetchers: Good stream detection + speculative fills reduce policy sensitivity

The Trade-off Space

 1                LRU
 2    ┌─────────────────────┐
 3    │ +1.4% IPC           │
 4    │ +12,500 gates       │ ← For FFT (most workloads)
 5    │ +0.36W power        │
 6    └─────────────────────┘
 7    
 8    PLRU
 9    ┌─────────────────────┐
10    │ Baseline            │
11    │ -41.6% area         │ ← Preferable if
12    │ -29% power          │   you tolerate 1-2% perf
13    └─────────────────────┘

Lessons Learned & Insights

Simulation Artifacts Matter: Trace sampling introduces <1% error but requires careful statistical validation
Prefetching Dominates: L1 replacement policy importance correlates inversely with prefetcher effectiveness
Hierarchy Resilience: Cache hierarchies remarkably robust—L2/L3 buffering absorbs most policy differences
Real-World Complexity: Production designs use adaptive policies (dynamic PLRU↔LRU switching) to capture benefits of both

Future Work

Short-term Extensions

Adaptive Replacement: Implement heuristic to switch policies based on access pattern regularity (online detection)
Larger Hierarchies: Extend analysis to L2/L3 cache policies (challenges: replacement latency, coherence interactions)
Realistic Prefetchers: Model next-line, stride, and stream prefetchers; measure policy impact under prefetching

Long-term Research Directions

ML-Guided Replacement: Train neural network to predict victim way from access history
Cross-Workload Optimization: Design unified policy minimizing worst-case performance across diverse workloads
Heterogeneous Caches: Implement per-region policies (PLRU for sequential, LRU for random regions)
Physical Implementation: Fabricate Boom core variants; measure actual silicon power/performance

Technical Stack

Component	Tool/Technology
Simulator	Boom Core (Chisel HDL generator)
ISA	RISC-V RV64I
Languages	C (benchmarks), Chisel (HDL), Python (analysis)
Build System	Make + Verilator (C++ simulation)
Analysis	Jupyter notebooks, matplotlib, scipy.stats
Version Control	Git + GitHub

Quick Start

 1# 1. Clone Boom Core and build
 2git clone https://github.com/riscv-boom/riscv-boom
 3cd riscv-boom && make sim
 4
 5# 2. Compile benchmark
 6riscv64-unknown-elf-gcc -O3 -march=rv64i benchmarks/fft.c -o fft
 7
 8# 3. Run simulation (default PLRU)
 9./simulator-Top +permissive -v fft.vcd fft
10
11# 4. Swap to LRU policy in Chisel, rebuild:
12sed -i 's/PLRU/LRU/g' src/main/scala/cache/L1.scala
13make sim
14
15# 5. Analyze traces
16python3 analysis/compare_policies.py fft_plru.csv fft_lru.csv

References & Further Reading

Hennessy & Patterson. "Computer Architecture: A Quantitative Approach" (6th ed.). Morgan Kaufmann, 2019. — Chapters 2-3 on cache hierarchies
Celio et al. "The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor". CARRV 2015
Jaleel et al. "Queues are Caches: Random Replacement in bounded Space". ISCA 2013 — Theoretical foundations
Kim et al. "Getting to Know Your CPU: A Systematic Study of Cache Associativity". MICRO 2016 — Industry data points

Links & Resources

Course Project: ECE 5504 - Computer Architecture, Virginia Tech (Fall 2024)

Last Updated: December 2024

Overview

Optimizes CPU cache performance through comparative analysis of LRU (Least Recently Used) and PLRU (Pseudo-LRU) replacement policies. Benchmarks FFT and SHA-256 algorithms on a custom RISC-V core, analyzing performance characteristics and memory access patterns.

Project Objectives

Analyze cache replacement policy performance impact on real workloads
Implement and test LRU and PLRU policies on Boom Core RISC-V simulator
Benchmark cryptographic (SHA-256) and signal processing (FFT) workloads
Measure memory access patterns and cache efficiency metrics
Generate performance reports and comparative analysis
Understand hardware/software co-design trade-offs
Optimize cache parameters for specific workload characteristics

Background & Motivation

Cache replacement policies significantly impact processor performance, yet the choice between sophisticated policies (LRU) and simpler alternatives (PLRU) involves complex trade-offs. This project investigates whether the additional complexity of LRU justifies its performance benefits on modern RISC-V architectures when running diverse workloads.

The Berkeley Out-of-Order Machine (Boom Core) provides an excellent platform for this analysis, allowing precise simulation of cache behavior while maintaining realistic microarchitectural features.

Technology Stack

Languages: C, Scala/Chisel
Hardware: RISC-V ISA, Berkeley Out-of-Order Machine (Boom Core)
Tools: RISC-V performance monitoring tools, custom benchmark utilities

Requirements & Setup

Minimum Requirements:

RISC-V cross-compiler (GCC RISC-V 10.x+)
Boom Core simulator environment
Python 3.8+ (for analysis and plotting)

Key Dependencies:

1- riscv-gnu-toolchain
2- Boom Core repo
3- Scala/Chisel (for hardware modifications)

Deliverables

Benchmark Results: CSV files with performance metrics
- FFT benchmarks (LRU/PLRU variants)
- SHA-256 benchmarks (LRU/PLRU variants)
Performance Analysis: Cache hit/miss rates, memory latency metrics
Documentation: Comparative analysis and findings

Project Structure

1project_work/
2├── benchmark_results.csv
3├── benchmark_results_lru_1.csv
4├── fft_benchmark_results_lru_1.csv
5├── sha256_benchmark_results_lru_1.csv
6├── core_new.txt (custom core config)
7└── README.md

Key Findings & Analysis

PLRU Performance: PLRU achieves 95-98% of LRU performance with 40% less complexity
Workload Impact: FFT shows 15% better cache hit rates due to superior spatial locality
SHA-256 Behavior: Achieves only 45-50% L1 hit rate, heavily depends on prefetching
Policy Trade-off: PLRU sufficient for most workloads; LRU beneficial only for irregular access patterns
Memory Bandwidth: Cache policy choice has minimal impact when bandwidth-bound

Experimental Methodology

Workload Characterization

FFT (Fast Fourier Transform):

Regular data access patterns
Strong spatial and temporal locality
Predictable memory hierarchy behavior
Cache-friendly algorithm structure

SHA-256 (Cryptographic Hash):

Irregular data access patterns
Limited locality due to key schedule expansion
Stress-tests cache associativity
Real-world security-critical workload

Metrics Collected

Cache hit/miss rates by cache level (L1/L2/L3)
Memory latency distribution
Bandwidth utilization
Instruction throughput
Power consumption estimates
Performance counter values

Challenges & Solutions

Challenge 1: Simulation Accuracy

Problem: Boom Core simulator diverges from real hardware behavior under high cache pressure. Solution: Validated results against SPEC benchmarks, used statistical methods for confidence intervals.

Challenge 2: Workload Variation

Problem: FFT and SHA-256 alone insufficient to characterize general-purpose workloads. Solution: Developed synthetic benchmarks combining regular and irregular access patterns.

Challenge 3: Long Simulation Times

Problem: Full-system simulations required 48+ hours per configuration. Solution: Implemented trace-based simulation with sampling techniques to reduce runtime 10x.

Results Summary

Metric	LRU	PLRU	Difference
FFT L1 Hit Rate	87.2%	85.9%	-1.3%
SHA-256 L1 Hit Rate	48.3%	46.1%	-2.2%
L2 Hit Rate (both)	92.1%	91.8%	-0.3%
Overall IPC	2.14	2.11	-1.4%
Hardware Gates	12500	7300	-41.6%

Lessons Learned

Hardware Complexity Matters: Simpler designs often sufficient for marginal performance gains
Workload-Aware Design: No one-size-fits-all solution; policy choice depends on target workloads
Simulation Challenges: Full-system simulation introduces subtle artifacts requiring careful validation
Prefetching Effect: Cache replacement policy less important when good prefetcher present

Future Work

Implement adaptive replacement policies that switch between LRU/PLRU dynamically
Extend analysis to larger cache hierarchies (4+ levels)
Include branch predictor and instruction cache interactions
Develop theoretical models predicting performance from workload characteristics
Test on physical RISC-V implementations (e.g., SiFive U74)

References

Hennessy & Patterson. "Computer Architecture: A Quantitative Approach" (6th ed.). 2019.
"RISC-V Specification" - RISC-V Foundation
Boom Core Documentation: https://github.com/riscv-boom/riscv-boom
Mathis, C. "Cache Replacement Policies: A Comparative Study" IEEE MICRO 2022

Semester 1 (Fall 2024) | Hardware Architecture

Last Updated: December 2024