# Systematic CXL Memory Characterization and Performance Analysis at Scale

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger<sup>\*</sup>, Marie Nguyen<sup>†</sup>,

Xun Jian, Sam H. Noh, Huaicheng Li





## Compute Express Link (CXL) Use Cases

#### Growing demand from memory-intensive applications









## Compute Express Link (CXL) Use Cases

#### Growing demand from memory-intensive applications









### How is CXL Implemented?

#### PCIe electricals + low-latency protocol layers



Transaction layer: queueing, processing, and ordering Link layer: transaction reliability, data integrity



















What is the performance implication of CXL memory across CXL devices, processors, and workloads at scale?

5

#### I. Measure average latency and bandwidth for single CXL device

Overlook performance variation

II. Quantify the performance of a ~10 workloads

Limited scope of workloads

III. Observational approaches for performance analysis

Lack of root-cause analysis

[1] Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices [MICRO '23]

[2] Exploring Performance and Cost Optimization with ASIC-Based CXL Memory [EuroSys '24]

[3] A Mess of Memory System Benchmarking, Simulation and Application Profiling [MICRO '24]

## Melody Overview

#### A comprehensive framework for CXL characterization and analysis

265 workloads across 4 CXL devices under 7 memory latency configurations on 5 CPUs!

Unstable and unpredictable latency introduced by CXL µs-scale tail latency even when bandwidth is not saturated

#### Extensive CXL characterization across diverse workloads

Quantitative slowdowns due to latency or bandwidth boundness

#### SPA: A simple and accurate performance analysis approach

9 CPU counters for accurate slowdown estimation (>95% accuracy for over 95% workloads) Dissect the root causes of CXL slowdown Disclose CPU prefetching inefficiency



#### Melody overview

#### CXL tail latency

#### Workload characterization

SPA: <u>Stall-based CXL performance analysis</u>



CXL tail latency

Workload characterization

SPA: <u>Stall-based CXL performance analysis</u>

### **CXL** Latency Variation



#### **CXL** Latency Variation



Average latency is not enough to capture CXL performance variations

#### **CXL** Latency Variation



Average latency is not enough to capture CXL performance variations

Some CXL devices exhibit unstable latency compared to regular DRAM

## Tail Latency across CXL Devices



9



#### Some CXL devices have lower tail latency (CXL-A, CXL-D)

## CXL Tail Latency in Workloads



CXL tail latency can lead to unpredictable application performance



Melody overview

CXL tail latency

Workload characterization

SPA: <u>Stall-based CXL performance analysis</u>

- Slowdown =  $(Time_{CXL} / Time_{DRAM} 1) * 100\%$
- Workload categories:
  - SPEC CPU 2017
  - PARSEC
  - Graph (GAPBS, PBBS)
  - Database (Redis, Voltdb)
  - ML/AI (GPT-2, Llama, MLPerf)
  - Data analytics (Spark)
  - Phoronix



#### CXL-A (214ns, 24GB/s)





CXL-A (214ns, 24GB/s)

CXL-D (239ns, 52GB/s)

Higher CXL bandwidth (24GB/s  $\rightarrow$  52GB/s) partially mitigates slowdowns tails



NUMA (212ns, 119GB/s) CXL-A (214ns, 24GB/s) CXL-D (239ns, 52GB/s)

Higher CXL bandwidth (24GB/s  $\rightarrow$  52GB/s) partially mitigates slowdowns tails

CXL~NUMA: The performance gap between (high-bandwidth) CXL and NUMA is closing!



Melody overview

CXL tail latency

Workload characterization

SPA: <u>Stall-based CXL performance analysis</u>

#### 16

#### How does CXL latency affect CPU pipeline efficiency?



#### 16

#### How does CXL latency affect CPU pipeline efficiency?



#### How does CXL latency affect CPU pipeline efficiency?



Slowdown (S) =  $\Delta Cycles / Cycles_{DRAM}$ 

 $\Delta Cycles = Cycles_{CXL} - Cycles_{DRAM}$ 

 $\approx \Delta Cycles_{Backend}$ 





How does CXL latency affect CPU pipeline efficiency?



 $\Delta Cycles = Cycles_{CXL} - Cycles_{DRAM}$ 

 $\approx \Delta Cycles_{Backend}$ 

$$\approx \Delta Stalls_{DRAM} + \Delta Stalls_{Cache} + \Delta Stalls_{Store}$$

How does CXL latency affect CPU pipeline efficiency?



 $\Delta Cycles = Cycles_{CXL} - Cycles_{DRAM}$ 

#### $\approx \Delta Cycles_{Backend}$



## CXL Slowdown Breakdown of Real Applications



21

## The Sources of Slowdown Vary across Workloads



Redis, VoltDB, GPT-2: Slowdown is mainly from demand read

GAPBS, LLAMA: Part of slowdown is caused by prefetching inefficiency

CPU 2017: Diverse slowdown from demand read, prefetching and store

## SPA for Performance Debugging & Optimization



23

## SPA for Performance Debugging & Optimization



23

Slowdown will be reduced from 13% to 2%

## More in the Paper!

CXL tail latency: analysis and reasoning

Factors for tail latency

#### Slowdown analysis

Large-scale experimental verification for SPA Period-based slowdown analysis

#### SPA use cases and implications

Performance debugging, tuning, and prediction



https://github.com/MoatLab/Melody

#### Thank you! Questions?

#### Systematic CXL Memory Characterization and Performance Analysis at Scale

Jinshu Liu, Hamid Hadian, Yuyue Wang Virginia Tech USA Marie Nguyen Samsung USA Daniel S. Berger Microsoft and University of Washington USA

Xun Jian, Sam H. Noh, Huaicheng Li Virginia Tech USA

#### Abstract

Compute Express Link (CXL) has emerged as a pivotal interconnect technology for enabling scalable memory expansion. Despite its potential, the performance implications of CXL across diverse devices, latency regimes, processor architectures, and workloads remain underexplored. In this paper, we present MELODN, a comprehensive framework for systematic characterization and analysis of CXL memory performance. MELODN leverages an extensive evaluation spanning 265 workloads, 4 real CXL devices, 7 latency levels, and 5 CPU platforms. MELODN the lamay key insights workload sensitivity to submicrosecond CXL latencies (140-410ms), the first disclosure and quantification of CXL-induced tail latency and its impact, CPU tolerance to CXL latencies, a novel stall-based root cause and spis approach (Sva) for pinpointing CXL bottlenceks, and the identification of CPU prefetcher imfeliciencies under CXL.

#### $\label{eq:ccs} \textit{CCS Concepts: } \bullet \textit{Hardware} \to \textit{Emerging technologies; } \bullet \textit{Computer systems organization} \to \textit{Architectures.}$

Keywords: Compute Express Link, CXL, Memory, Profiling

#### ACM Reference Format:

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, and Xun Jian, Sam H. Noh, Huaicheng Li. 2023. Systematic CXL Memory Characterization and Performance Analysis at Scale. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '25), March 30–April 3, 2025, Rotterdam, Netherlands. ACM, New York, NY, USA, 15 pages. https: //doi.org/10.1145/3676641.3715987

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granufed without fee provided hat copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ASELOS '25, ROME'dam. Netherlands.

 $\otimes$  2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ACM ISBN 979-8-4007-1079-7/25/03 https://doi.org/10.1145/3676641.3715987 CXL+multiclops CXL+multiclops CXL+NUMA CXL+

Figure 1. The spectrum of sub-µs CXL latency and bandwidth.

#### 1 Introduction

Driven by the growing requirements of memory-intensive applications, the demand for increased memory capacity is rapidly rising [37]. The surge is further compounded by DRAM scaling challenges [41]. Emerging interconnects like Compute Express Link (CXL) hold the promise of both scaleup and scale-out memory expansion at the server/rack levels [34, 36, 45]. Various memory wondors have introduced CXL memory expanders [5, 4, 8, 15], some of which are being deployed in production systems, facilitating access to signifiantly larger amounts of DRAM than previously feasible

Low memory access latency is key to system performance. but CXL memory expansion introduces higher latencies compared to traditional socket-local DRAM [27, 34, 42]. Figure 1 illustrates the substantial heterogeneity in CXL latency and bandwidth, as measured across 4 CXL devices within our platform (Table 1) and 2 more data points from public sources<sup>1</sup>[15, 17]. Furthermore, CXL devices can exhibit varying performance characteristics. The variability in latency and bandwidth arises from varying interconnection topologies and vendor optimizations [27, 42]. For instance, the latencies of locally-attached CXL range from ~200-400ns, slightly exceeding NUMA latency. Accessing CXL memory from a remote socket results in increased latency and diminished bandwidth (CXL+NUMA). The use of CXL switch(es) to extend connectivity will introduce additional latencies (CXL+Switch), even elevating latency to approximately 600ns.

The current CPU architecture and memory hierarchy are tailored for typical multi-socket systems, offering ~100ns latency and 100s of GB/s bandwidth. However, the performance implications of CXL memory with sub-us latencies remain

<sup>1</sup>CXL+Switch data is from [15], and bandwidth is averaged for 1 CXL device.