How to use CUPTI APIs to build a CUDA profiler?

Introduction

Besides the well-known Nsight Systems (nsys) and NVIDIA Nsight Compute (ncu), NVIDIA also provides a set of profiling APIs called CUDA Profiling Tools Interface (CUPTI). When we want to build a custom profiler for special purposes or tracing specific events, CUPTI is a good choice. In this post, I will share my experiences and lessons learned from building a CUPTI-based CUDA profiler.

Useful CUPTI APIs

CUPTI provides a set of APIs, which can be divided into two categories: tracing APIs and profiling APIs. Tracing APIs are used to trace the execution of CUDA kernels at the function level, while profiling APIs are used to collect performance metrics inside the kernels.

Kind	Category	Description
Activity	Tracing	Asynchronously record CUDA activities, e.g., CUDA API, Kernel, memory copy
Callback	Tracing	CUDA event callback mechanism to notify subscriber that a specific CUDA event executed
Host Profiling	Profiling	Host APIs for enumeration, configuration and evaluation of performance metrics
Range Profiling	Profiling	Target APIs for collection of performance metrics for a range of execution
PC Sampling	Profiling	Sampling of the warp program counter and warp scheduler state (stall reasons)
SASS Metrics	Profiling	Collect kernel performance metrics at the source level using SASS patching
PM Sampling	Profiling	Collect hardware metrics by sampling the GPU performance monitors (PM) periodically
Profiling	Profiling	Target APIs for collection of performance metrics for a range of execution. Host operations - enumeration, configuration and evaluation of metrics are supported by the Perfworks Metrics API.
Event	Profiling	Collect kernel performance counters for a kernel execution
Metric	Profiling	Collect kernel performance metrics for a kernel execution
Checkpoint	-	Provides support for automatically saving and restoring the functional state of the CUDA device

Tracing APIs

CUDA functions tracing can be achieved by two ways: asynchronous activity recording and synchronous activity interception, which, respectively, correspond to the activity APIs and the callback APIs.

Activity APIs

The activity APIs provide an interface to see a log of every event on GPUs asynchronously.

At the beginning of a program, developers should call

CUptiResult CUPTIAPI cuptiSubscribe(CUpti_SubscriberHandle *subscriber,
                                    CUpti_CallbackFunc callback,
                                    void *userdata);

to initialize the CUPTI monitor thread, where callback is a user-defined function to specify the behavior when some callback condition is satisfied (which is further specified by cuptiEnableCallback; in the official example, it is set to CUDA device reset or CUDA fatal errors).

Developers should also call

CUptiResult CUPTIAPI cuptiActivityRegisterCallbacks(
CUpti_BuffersCallbackRequestFunc funcBufferRequested,
      	CUpti_BuffersCallbackCompleteFunc funcBufferCompleted);

for buffer management. funcBufferRequested will be called when the buffer userdata (passed as above) is full. funcBufferCompleted will be called when records are popped from the buffer. Usually, it is used to output the records in a user-defined way.

After basic initialization, developers can use

CUptiResult CUPTIAPI cuptiActivityEnable(CUpti_ActivityKind kind);

to enable the tracing of certain activities. The list of activities can be viewed in the official doc. There are 55 kinds of activities in total, and here we introduce the most useful ones.

CUPTI_ACTIVITY_KIND_MEMCPY: memory copy
- [Warning] cudaMemcpyAsync not included. CUPTI_ACTIVITY_KIND_KERNEL: kernel execution
- [Warning] Enabling this activity will cause all kernels to execute sequentially in the program execution, which may break the original semantics.
- To be checked.
CUPTI_ACTIVITY_KIND_RUNTIME: runtime functions
- [Warning] cudaMemcpyAsync is categorized into runtime functions instead of memory copy operations. God knows why.
- [Note] including but not limited to cudaLaunchKernel, cudaMalloc, and cudaFree.
CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL: kernel execution
- This doesn’t break the execution order/parallelism of different kernels
CUPTI_ACTIVITY_KIND_SOURCE_LOCATOR
- [Note] to be looked into.
CUPTI_ACTIVITY_KIND_OVERHEAD: profiling overhead
- [Note] May be useful for evaluation

Callback API

The callback APIs allow developers to register hooks during the execution of CUDA programs.

At the beginning of the execution, developers should manually register hooks with its invoking conditions like

// One subscriber is used to register multiple callback domains
CUpti_SubscriberHandle subscriber;
CUPTI_API_CALL(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)ProfilerCallbackHandler, NULL));
// Runtime callback domain is needed for kernel launch callbacks
CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel));
// Resource callback domain is needed for context creation callbacks
CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_CONTEXT_CREATED));

The hook function should be defined as

void
ProfilerCallbackHandler(
    void *pUserData,
    CUpti_CallbackDomain domain,
    CUpti_CallbackId callbackId,
    void const *pCallbackData);