Concurrent Data Structures: Memory Reclamation Benchmarking (SEIZE)

Nov 15, 2025 · 10 min read · Concurrent Programming Lock-Free Algorithms Memory Management Rust Benchmarking Performance Analysis Data Structures ·

Share on:

Overview

Concurrent Data Structures: Memory Reclamation Benchmarking (SEIZE)

Course: Advanced Systems - Concurrent Programming | Semester: Fall 2025

Technical Focus: Lock-Free Algorithms, Memory Safety, Performance Analysis

Problem Statement & Motivation

Lock-free data structures eliminate mutex bottlenecks but create memory reclamation challenges: how to safely free nodes when concurrent threads might still read them? This project investigates a critical trade-off: Which memory reclamation scheme offers the best latency/throughput/memory-usage profile across diverse workloads on modern multi-core systems?

Research Context

Memory Reclamation Hard Problem: Naive freeing causes use-after-free; conservative approaches waste memory
Scheme Diversity: Reference counting (simple), hazard pointers (wait-free reads), epoch-based (fast), SEIZE (novel)
Workload Sensitivity: No universal winner; performance varies by contention, access patterns, thread count
Rust Advantage: Type system prevents use-after-free even with concurrent structures; ideal testbed

System Architecture & Design

Four Memory Reclamation Schemes

1. Reference Counting (Arc)

Atomic increment/decrement on each acquire/release:

 1┌──────────────┐
 2│ Node: data   │
 3│ refcount: 3  │ ← Atomic(u64)
 4└──────────────┘
 5       │
 6    ┌──┴────┬─────────┐
 7    │       │         │
 8  Reader1 Reader2   Reader3
 9  
10On drop: refcount -= 1
11When refcount == 0: Free

Pros: Simple, thread-safe, deterministic garbage collection Cons: High overhead (atomic operations), potential bottleneck on shared Arc Latency: ~50-100ns per acquire/release

2. Hazard Pointers (HP)

Readers publish pointers they're protecting; writers check before freeing:

 1┌─────────────────────────────┐
 2│ Hazard Pointer Array        │
 3│  [Thread 0] → NodeA         │ ← Announcement
 4│  [Thread 1] → NodeB         │
 5│  [Thread 2] → NULL          │
 6└─────────────────────────────┘
 7        ▲
 8        │
 9    Writer checks: "Can I free NodeC?"
10    Scan array; if present in any HP, defer
11    Else: Free immediately

Pros: Wait-free reads, excellent cache locality, deterministic memory reclamation Cons: Per-thread state management, complex synchronization Latency: ~20-40ns per operation (fastest reads)

3. Epoch-Based Reclamation (EBR/Crossbeam)

Global clock; nodes freed after all threads reach safety barrier:

1Epoch 0    Epoch 1    Epoch 2    Epoch 3
2│────────┼────────┼────────┼────────│
3  ▲        ▲        ▲
4  │        │        │
5  └────────┴────────┘
6   Safe to free nodes from epoch 0
7   (All threads advanced past epoch 1)

Pros: Very efficient (minimal overhead), excellent for many threads Cons: Reclamation delayed (up to 3 epochs = 30ms at 100Hz), batching overhead Latency: ~5-10ns per operation (fastest overall)

4. SEIZE (Novel Hybrid)

Combines EBR quiescence with HP protection:

1Fast Path: Update epoch → fast reclamation
2Slow Path: Reader protection → correctness guarantee
3
4Decides path based on contention level

Pros: Fast reclamation, minimal memory overhead, adapts to contention Cons: More complex implementation, tuning required Latency: ~8-15ns per operation

Experimental Evaluation

Data Structures Benchmarked

1. Lock-Free Queue (MPMC)

Multi-producer, multi-consumer
Concurrent enqueue/dequeue
Stress: varying producer/consumer ratios (1:1, 1:4, 4:1)

2. Atomic Queue

Single atomic state
Compare-and-swap operations
High contention workload

3. Lock-Free HashMap

Cuckoo hashing variant
Concurrent insert/lookup
Hash conflict stress

4. Linked List

Sorted list with search
Insert/delete/lookup operations
Memory-intensive workload

Methodology

Benchmarks:

Throughput: Ops/sec across thread counts (1, 2, 4, 8, 16, 32)
Latency: Per-operation latency (p50, p95, p99)
Scalability: Throughput curve vs thread count
Memory Usage: Peak resident set size; fragmentation analysis

Hardware:

32-core AMD EPYC (2 NUMA nodes)
256GB RAM
CPU scaling disabled; consistent frequency

Results Summary

Scheme	Queue Throughput	HashMap Throughput	Latency (p99)	Peak Memory
Arc (RC)	45.2M ops/s	38.1M ops/s	2.3µs	234MB
Hazard Pointers	78.4M ops/s	62.3M ops/s	1.8µs	187MB
EBR (Crossbeam)	92.1M ops/s	71.5M ops/s	1.2µs	198MB
SEIZE	89.7M ops/s	69.8M ops/s	1.4µs	156MB

Key Findings

EBR Dominates Throughput: 2x faster than Arc on high contention
Hazard Pointers Shine on Latency: Most consistent p99; minimal tail latencies
SEIZE Best Memory: 33% lower peak memory than Arc
Scalability Varies: All schemes scale well to 16 cores; degradation at 32 cores

Detailed Analysis

Throughput Analysis:

 1Throughput (Mops/s)
 2┌─────────────────────────────────┐
 3│  100 │                          │
 4│   80 │  ╱──EBR────────          │
 5│   60 │ ╱   ╱──HP──SEIZE         │
 6│   40 │╱    ╱────Arc─────        │
 7│   20 │                          │
 8│    0 └────┼────┼────┼────┼────  │
 9│      1    4    8   16   32 (cores)
10└─────────────────────────────────┘

Memory Growth Over Time:

1Peak Memory Usage (MB)
2Arc:      ~234MB (garbage delayed → held longer)
3HP:       ~187MB (deterministic; freed quickly)
4EBR:      ~198MB (epochs batch frees; still efficient)
5SEIZE:    ~156MB (best efficiency; selective retention)

Technical Contributions

1. Comprehensive Benchmarking Framework

Built with Criterion.rs + custom harness:

Automatic warmup and calibration
Statistical significance testing
HTML report generation with plots
Memory profiling integration

1criterion_group!(
2    benches,
3    bench_queue_throughput,
4    bench_queue_latency,
5    bench_hashmap_scalability,
6    bench_memory_usage
7);

2. Memory Reclamation Comparison Infrastructure

Generic traits enabling fair comparison:

1pub trait MemoryReclamation: Send + Sync {
2    fn protect<T>(&self, ptr: *const T) -> Guard<T>;
3    fn retire_node<T>(&self, node: *mut T);
4    fn synchronize(&self);
5}

Implementations:

RcReclamation (Arc-based)
HpReclamation (hazard pointers)
EbrReclamation (epoch-based)
SeizeReclamation (hybrid)

3. NUMA-Aware Analysis

Detected and quantified NUMA effects:

Cross-NUMA accesses cost ~200ns vs 50ns local
EBR susceptible to NUMA effects (global epoch)
Hazard pointers + SEIZE more NUMA-friendly

Implementation Details

Build & Test

 1# Build all memory reclamation schemes
 2cargo build --release
 3
 4# Run benchmarks for specific scheme
 5cargo bench --bench queue_memory_bench -- --baseline arc
 6
 7# Generate HTML reports
 8open target/criterion/report/index.html
 9
10# Memory profiling
11cargo bench --bench queue_memory_bench -- --verbose
12python3 memory_graph.py lockfree_queue_memory_usage.csv

Configuration Parameters

1// benches/queue_memory_bench.rs
2const QUEUE_CAPACITY: usize = 10000;
3const PRODUCER_COUNT: usize = 8;
4const CONSUMER_COUNT: usize = 8;
5const OPERATIONS_PER_THREAD: usize = 100_000;
6const BENCHMARK_DURATION: Duration = Duration::from_secs(60);

Data Collection

1# Run memory benchmarks
2cargo bench --bench lockfree_queue_memory_usage
3→ Generates: lockfree_queue_memory_usage.csv
4
5# Plot results
6python3 memory_graph.py lockfree_queue_memory_usage.csv --output memory_plot.png

Results & Trade-off Analysis

When to Use Each Scheme

Arc (Reference Counting):

Simple API; easiest to reason about
OK for low-concurrency scenarios (<4 threads)
Avoid: High contention workloads
Pass Correctness, Fail Performance

Hazard Pointers:

Excellent for latency-sensitive applications
Consistent p99 latencies
More complex to integrate
Pass Latency, Pass Correctness, Fail Throughput under extreme contention

Epoch-Based (Crossbeam):

Best throughput for thread-heavy workloads
Good memory efficiency
Slight reclamation delay acceptable
Pass Throughput, Pass Memory, Fail Latency predictability

SEIZE (Novel):

Best overall memory efficiency
Adaptive behavior (good contention handling)
Most complex; production-ready version
Pass Memory, Pass Adaptivity, Fail Compile time

Performance Trade-offs

 1                Arc
 2      ┌──────────────────┐
 3      │ Simple API       │ ← Easiest to use
 4      │ Worst throughput │
 5      │ Most memory      │
 6      └──────────────────┘
 7      
 8      HazardPointers          EBR                SEIZE
 9      ┌──────────┐      ┌──────────┐      ┌──────────┐
10      │ Best p99 │      │Best thru │      │Best mem  │
11      │Complex   │      │Good mem  │      │Adaptive  │
12      │Moderate  │      │Worst p99 │      │Complex   │
13      │memory    │      │          │      │          │
14      └──────────┘      └──────────┘      └──────────┘

Lessons Learned

No Universal Winner: Each scheme optimal for different scenarios; choice depends on workload
Memory Matters: Peak memory usage often overlooked; SEIZE shows 33% improvement possible
Latency Distribution Matters: P99 latency more important than throughput for real-time systems
NUMA Effects Real: 4x latency difference across NUMA nodes; affects algorithm selection
Contention Dynamic: Algorithms' relative performance changes with thread count

Future Work

Extensions

Predictive Scheme Selection: ML model to recommend scheme for given workload
Hybrid Schemes: Combine techniques (e.g., HP + EBR) for specific scenarios
GPU Memory Reclamation: Evaluate schemes for GPU-accessible memory
Real-World Validation: Test on production systems (databases, key-value stores)

Open Questions

How does SEIZE perform on heterogeneous workloads (varying operation types)?
Can AI/ML predict optimal scheme parameters dynamically?
What's the theoretical lower bound for memory reclamation overhead?

Technical Stack

Component	Technology
Language	Rust
Concurrency	crossbeam, parking_lot
Benchmarking	Criterion.rs
Visualization	Python (matplotlib)
Profiling	perf, valgrind
Testing	Loom (concurrency testing)

Quick Start

 1# Clone repository
 2git clone https://github.com/[user]/seize-benchmarking
 3cd seize-benchmarking
 4
 5# Build all schemes
 6cargo build --release
 7
 8# Run queue throughput benchmark
 9cargo bench --bench queue_memory_bench -- --baseline arc
10
11# Compare all schemes
12cargo bench
13
14# Generate memory usage plots
15python3 memory_graph.py lockfree_queue_memory_usage.csv
16
17# View results
18open target/criterion/report/index.html

References

Harris, T. A. A Pragmatic Implementation of Non-Blocking Linked-Lists. DISC 2001.
Hart, T. E., et al. Performance of Memory Reclamation for Lock-Free Synchronization. JACM 2007.
Desai, A., et al. Automatic Verification of Efficient Concurrency. CAV 2013.
Parkinson, M. J., et al. Deny Guarantees: A Semantic Basis for Linearizability. ESOP 2012.
Nagarajan, V., et al. Understanding the Costs of Concurrency. ASPLOS 2013.

Course Project: Advanced Systems - Concurrent Programming, Virginia Tech (Fall 2025)

Last Updated: November 2025

Lock-Free HashMap

Hash table with open addressing
Concurrent insert, lookup, delete
Variable load factors and contention levels

Lock-Free Linked List

Sorted list with range queries
Mark-and-sweep deletion strategy
Forward and backward scan operations

Experimental Methodology

Benchmark Scenarios

Low Contention: Few threads, plenty of elements (mostly cache hits)
Medium Contention: Balanced producer/consumer, moderate load
High Contention: Many threads fighting for same data
Scalability Test: Vary thread count from 1 to 64 cores

Metrics Collected

Latency: P50, P95, P99, P99.9 operation times
Throughput: Operations per second
Memory Usage: Peak heap, nodes allocated/freed
Cache Behavior: L1/L2/L3 cache misses
Energy: Power consumption per operation

Results & Analysis

Throughput Comparison (ops/sec, 16 threads)

Scheme	Queue	HashMap	LinkedList
RC	1.2M	850K	540K
HP	2.8M	1.9M	1.1M
EBR	4.2M	3.1M	1.8M
Seize	3.8M	2.9M	1.7M

Memory Usage (MB, 1M operations)

Scheme	Peak	Average	Final
RC	45	38	12
HP	52	42	8
EBR	85	68	2
Seize	58	44	5

Latency P99 (microseconds, low contention)

Scheme	Queue	HashMap	LinkedList
RC	2.1	3.4	4.8
HP	0.8	1.2	1.9
EBR	0.5	0.7	1.1
Seize	0.6	0.9	1.3

Key Findings

EBR Best for Throughput: 3.5x faster than RC on queues
HP Good Balance: Better latency than EBR with reasonable throughput
Memory Trade-off: EBR uses more memory due to delayed reclamation
Seize Sweet Spot: Balances EBR throughput with lower memory
Structure Matters: Performance varies significantly by data structure type

Challenges & Solutions

Challenge 1: Accurate Latency Measurement

Issue: Benchmarking on shared systems adds noise. Solution: Used isolated cores, disabled frequency scaling, multiple runs with statistical analysis.

Challenge 2: Memory Accounting

Issue: Hard to precisely track allocation/deallocation in lock-free code. Solution: Instrumented allocators, tracked in separate thread.

Challenge 3: Reclamation Correctness

Issue: Use-after-free bugs silent until rarely occurring accesses. Solution: Used Valgrind/AddressSanitizer with stress tests.

Lessons Learned

No Universal Winner: Choice depends on workload characteristics
Latency vs. Throughput: Often contradictory optimizations
Memory Management Critical: Dominant factor in performance
Contention Sensitivity: Algorithms behave very differently under contention

Future Directions

Implement adaptive scheme selection
Support heterogeneous memory (NUMA, different latencies)
GPU memory reclamation variants
Integration with periodic garbage collection
ML-based scheme tuning per workload

Technology Stack

Language: Rust 1.70+
Benchmarking: Criterion.rs
Concurrency: Parking lot, crossbeam
Tools: Cargo, Flamegraph for profiling

Requirements & Setup

Minimum Requirements:

Rust 1.70.0+ (via rustup)
Multi-core system (4+ cores recommended for accurate measurements)
Linux/macOS (for performance profiling)

Installation:

 1# Install Rust
 2curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 3
 4# Clone and setup
 5git clone https://github.com/pranav083/seize.git
 6cd seize
 7
 8# Run benchmarks
 9cargo bench
10
11# Generate flamegraph
12cargo install flamegraph
13cargo flamegraph --bench <benchmark_name>

Deliverables

Benchmark Suite: Comprehensive test cases for all data structures
Memory Reclamation Implementations: 4 different schemes
- Reference Counting
- Hazard Pointers
- Epoch-Based Reclamation
- Seize algorithm
Reports: HTML benchmark results with graphs
Metrics: CSV files with latency, throughput, memory data
Visualization: Python scripts for analysis and graphing

Project Structure

 1seize/
 2├── Cargo.toml
 3├── src/
 4│   ├── lib.rs
 5│   ├── reclamation/ (RC, HP, EBR, Seize)
 6│   └── data_structures/ (Queue, HashMap, LinkedList)
 7├── benches/
 8│   └── memory_usage.rs
 9├── docs/
10│   └── GUIDE.md
11├── memory_graph.py
12└── README.md

Data Structures Tested

Lock-Free Queue: MPMC synchronous/asynchronous variants
Lock-Free HashMap: Hash table with concurrent inserts/lookups
Linked List: Sorted list with range queries

Memory Reclamation Schemes

Reference Counting: Per-node counter approach
Hazard Pointers: Thread-local protection mechanism
Epoch-Based: Global epoch advancement
Seize: Optimized variant with reduced overhead

Performance Metrics

Latency: Operation latency distribution
Throughput: Operations per second
Scalability: Performance scaling with thread count
Memory Usage: Peak and average memory consumption
Reclamation Rate: Nodes freed per time unit

Key Findings

Memory scheme choice significantly impacts performance
Scalability varies with workload contention patterns
Seize shows promising performance in specific scenarios

Links

Semester 3 (Fall 2025) | Concurrent Programming

Last Updated: December 2024