How to use CUPTI APIs to build a CUDA profiler?
Introduction
Besides the well-known Nsight Systems (nsys) and NVIDIA Nsight Compute (ncu), NVIDIA also provides a set of profiling APIs called CUDA Profiling Tools Interface (CUPTI). When we want to build a custom profiler for special purposes or tracing specific events, CUPTI is a good choice. In this post, I will share my experiences and lessons learned from building a CUPTI-based CUDA profiler.
Useful CUPTI APIs
CUPTI provides a set of APIs, which can be divided into two categories: tracing APIs and profiling APIs. Tracing APIs are used to trace the execution of CUDA kernels at the function level, while profiling APIs are used to collect performance metrics inside the kernels.
Kind | Category | Description |
---|---|---|
Activity | Tracing | Asynchronously record CUDA activities, e.g., CUDA API, Kernel, memory copy |
Callback | Tracing | CUDA event callback mechanism to notify subscriber that a specific CUDA event executed |
Host Profiling | Profiling | Host APIs for enumeration, configuration and evaluation of performance metrics |
Range Profiling | Profiling | Target APIs for collection of performance metrics for a range of execution |
PC Sampling | Profiling | Sampling of the warp program counter and warp scheduler state (stall reasons) |
SASS Metrics | Profiling | Collect kernel performance metrics at the source level using SASS patching |
PM Sampling | Profiling | Collect hardware metrics by sampling the GPU performance monitors (PM) periodically |
Profiling | Profiling | Target APIs for collection of performance metrics for a range of execution. Host operations - enumeration, configuration and evaluation of metrics are supported by the Perfworks Metrics API. |
Event | Profiling | Collect kernel performance counters for a kernel execution |
Metric | Profiling | Collect kernel performance metrics for a kernel execution |
Checkpoint | - | Provides support for automatically saving and restoring the functional state of the CUDA device |
Tracing APIs
CUDA functions tracing can be achieved by two ways: asynchronous activity recording and synchronous activity interception, which, respectively, correspond to the activity APIs and the callback APIs.
Activity APIs
The activity APIs provide an interface to see a log of every event on GPUs asynchronously.
At the beginning of a program, developers should call
CUptiResult CUPTIAPI cuptiSubscribe(CUpti_SubscriberHandle *subscriber,
CUpti_CallbackFunc callback,
void *userdata);
to initialize the CUPTI monitor thread, where callback
is a user-defined function to specify the behavior when some callback condition is satisfied (which is further specified by cuptiEnableCallback
; in the official example, it is set to CUDA device reset or CUDA fatal errors).
Developers should also call
CUptiResult CUPTIAPI cuptiActivityRegisterCallbacks(
CUpti_BuffersCallbackRequestFunc funcBufferRequested,
CUpti_BuffersCallbackCompleteFunc funcBufferCompleted);
for buffer management. funcBufferRequested
will be called when the buffer userdata
(passed as above) is full. funcBufferCompleted
will be called when records are popped from the buffer. Usually, it is used to output the records in a user-defined way.
After basic initialization, developers can use
CUptiResult CUPTIAPI cuptiActivityEnable(CUpti_ActivityKind kind);
to enable the tracing of certain activities. The list of activities can be viewed in the official doc. There are 55 kinds of activities in total, and here we introduce the most useful ones.
- CUPTI_ACTIVITY_KIND_MEMCPY: memory copy
- [Warning]
cudaMemcpyAsync
not included. CUPTI_ACTIVITY_KIND_KERNEL: kernel execution - [Warning] Enabling this activity will cause all kernels to execute sequentially in the program execution, which may break the original semantics.
- To be checked.
- [Warning]
- CUPTI_ACTIVITY_KIND_RUNTIME: runtime functions
- [Warning]
cudaMemcpyAsync
is categorized into runtime functions instead of memory copy operations. God knows why. - [Note] including but not limited to
cudaLaunchKernel
,cudaMalloc
, andcudaFree
.
- [Warning]
- CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL: kernel execution
- This doesn’t break the execution order/parallelism of different kernels
- CUPTI_ACTIVITY_KIND_SOURCE_LOCATOR
- [Note] to be looked into.
- CUPTI_ACTIVITY_KIND_OVERHEAD: profiling overhead
- [Note] May be useful for evaluation
Callback API
The callback APIs allow developers to register hooks during the execution of CUDA programs.
At the beginning of the execution, developers should manually register hooks with its invoking conditions like
// One subscriber is used to register multiple callback domains
CUpti_SubscriberHandle subscriber;
CUPTI_API_CALL(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)ProfilerCallbackHandler, NULL));
// Runtime callback domain is needed for kernel launch callbacks
CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel));
// Resource callback domain is needed for context creation callbacks
CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_CONTEXT_CREATED));
The hook function should be defined as
void
ProfilerCallbackHandler(
void *pUserData,
CUpti_CallbackDomain domain,
CUpti_CallbackId callbackId,
void const *pCallbackData);