How to trace an ML application?

Not finished yet

Chinese version can be found here.

Recently I’m building a powerful profiler for ML applications, in which I need to trace the execution of CUDA kernels and reason the dependencies, control flow, and data flow between them. During this process, I have tried ~~and failed on~~ many tools, so here let me share my experiences with you.

Goal

To avoid the confusion, let me first clarify what I mean by “tracing”. In short, this basically means to extract the execution flow of a machine learning model. It is funny that, when I discuss this goal with two friends (one ML expert and one GPU expert), they immediately responded with (1) what? I know the architecture of a Transformer model! and (2) what? I know how to use Nsight Systems! respectively. It was a pity that there were no MLSys experts, or I would have received a third response like (3) what? I know how to use PyTorch Profiler. Actually, what I want to do is just to combine the three responses together automatically! Given a ML application, I want to convert its execution into a directed acyclic graph (DAG) with (1) CUDA kernels as nodes, (2) data/tensor dependencies as edges, and (3) control flow as attributes.

For example, this is the trace I want to extract from a simple single-layer GPT model:

module=transformer.h.0.ln_1, kernel=void at::native::(anonymous namespace)::vectorized_layer_norm_kernel<c10::Half, float>,