#### Systematic CXL Memory Characterization and Performance Analysis at Scale Microsoft **SAMSUNG<sup>†</sup>** VIRGINIA TECH

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger\*, Marie Nguyen<sup>+</sup>, Xun Jian, Sam H. Noh, Huaicheng Li

jinshu@vt.edu

## Heterogenous CXL Latency and Bandwidth

# Melody Overview



CXL introduces diverse and higher latency, what is the performance implication of CXL memory across CXL devices, processors, and workloads at scale?

A comprehensive framework for CXL characterization and analysis

265 workloads across 4 CXL devices under 7 memory latency levels on 5 processors

1. Unstable and unpredictable latency introduced by CXL

µs-scale memory tail latency even when bandwidth is not saturated

2. Extensive CXL characterization across diverse workloads

Quantitative slowdowns due to latency or bandwidth boundness

3. SPA: A simple and accurate performance analysis approach

#### **Research Questions:**

Is CXL latency as stable/predictable as regular DRAM? How does CXL latency affect workload performance? How does CXL latency affect CPU pipeline (e.g., prefetching)? 9 CPU counters for accurate slowdown estimation (<5% inaccuracy for over 95% workloads)

Dissect the root causes of CXL slowdown

Disclose CPU prefetching inefficiency

## CXL Tail Latency

### Workload Characterization on CXL



The performance gap between CXL(-D) and NUMA diminishes due to its CXL memory can be used as a viable alternative to NUMA memory

## SPA: <u>Stall-based CXL Performance Analysis</u>

1. Workload slowdown breakdown

∆ Cache-Stalls  $\Delta CPU Cycles \approx \Delta DRAM-Stalls +$  $\Delta$  Store-Stalls

Slowdown (S) =  $\Delta$  CPU Cycles / Cycles on Local

S<sub>DRAM</sub> + S<sub>Cache</sub> S<sub>Store</sub> Slowdown (S) =+

Demand read miss on L3

Less efficient prefetching under longer memory latency



#### 3. Cache slowdown reasoning



**S**<sub>DRAM</sub>

#### S<sub>Store</sub> Intensive RFO on Store Buffer with limited size

#### 2. CXL slowdown for real-world workloads

The sources of slowdown vary across workloads





#### 4. Dynamic slowdown



More in the paper

Appears in the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25)