Characterizing the energy consumption of CUDA kernels

Energy consumption has been a critical focus in machine learning systems. Recent works (Zeus [NSDI’23], EnvPipe[ATC’23], u-Serve [ATC’24], Perseus [SOSP’24] and DynamoLLM [HPCA’25]) have shown that the energy consumption of machine learning systems can be optimized by adjusting GPU frequency, batch size and some other hyperparameters. This post provides an analysis of the energy consumption of common operators (or CUDA kernels) used in machine learning systems.

Motivation

We want to understand how the energy consumption of CUDA kernels varies with SM utilization, memory access patterns, CUDA core / tensor core usage, and other factors. Therefore, we profile the energy consumption of GEMM and matrix addition operators with different shapes and on different GPUs. Also, we observe the collected data to understand the behavior of common energy profiling tools.

Methodology

We follow the state-of-the-art works and use the Zeus framework with the NVIDIA NVML library to collect the power consumption of the GPU. This method has certain limitations, including (1) low sampling rate, (2) latency, and (3) lack of fine-grained power consumption data. To address these limitations, we replay each operator multiple times on Python level and divide the total energy consumption by the number of executions. Furthermore, each operator is executed for multiple iterations, and we report the mean and standard deviation of the energy consumption.

Results

GEMM operator

We profile the energy consumption of the GEMM operator with different matrix shapes on the NVIDIA RTX 4070 Ti Super GPU.