Making PyTorch Static: A Step-by-Step Walkthrough

Chinese version can be found here

As you may know, I’ve applied programming-language techniques to analyze ML systems. I recently built a statically linked version of PyTorch to simplify static analysis. Here’s my approach and the steps I took.

Why Static Linking?

Static analysis in programming language (PL) research typically works on LLVM IR, and the same applies to ML applications. However, most ML frameworks use dynamic linking. For example:

#include <torch/torch.h>

int main() {
  auto x = torch::randn({4, 4});
  auto y = torch::randn({4, 4});
  auto z = torch::matmul(x, y);
}

the executable dynamically loads libtorch.so, which in turn depends on libraries like libc10.so, libtorch_cpu.so, and libtorch_cuda.so. This setup makes cross-module static analysis challenging and hinders static analysis tools, not to mention dynamic approaches such as symbolic execution.

By contrast, static linking combines all dependencies and application code into a single binary. This unified artifact simplifies analysis, allowing LLVM’s static analysis tools to operate on the complete bitcode without external dependencies.

Experience

If you just need a quick reference, you can jump to the Steps section.

Before attempting a static build, I reviewed the default dynamic build process. Typically, you would add

add_executable(code playground/test.cpp)
target_link_libraries(code PRIVATE torch)

to the CMakeLists.txt file. This will automatically link the whole PyTorch library, including all the dependencies.

Building Main Libraries

Next, I attempted to build a static version by modifying the CMakeLists.txt. PyTorch provides a BUILD_SHARED_LIBS option, so I disabled shared libraries (OFF) and recompiled—but quickly ran into issues. The first errors I encountered were related to some sublibraries that do not support static linking. For example, libtorch_python.so should be linked with the python interpreter library, and similar errors happen on caffe2 and other libraries. Luckily, I found a blog (well it’s in Chinese) that hacks the parameter that PyTorch team uses to build a partial static version (I would explain why it’s partial later). So I just copy the command and run it:

cmake \
-DCMAKE_VERBOSE_MAKEFILE:BOOL=1 \
-DUSE_CUDA=ON \
-DBUILD_CAFFE2=OFF \
-DBUILD_CAFFE2_OPS=OFF \
-DUSE_DISTRIBUTED=OFF \
-DBUILD_TEST=OFF \
-DBUILD_BINARY=OFF \
-DBUILD_MOBILE_BENCHMARK=0 \
-DBUILD_MOBILE_TEST=0 \
-DUSE_ROCM=OFF \
-DUSE_GLOO=OFF \
-DUSE_LEVELDB=OFF \
-DUSE_MPI:BOOL=OFF \
-DBUILD_PYTHON:BOOL=OFF \
-DBUILD_CUSTOM_PROTOBUF:BOOL=OFF \
-DUSE_OPENMP:BOOL=OFF \
-DBUILD_SHARED_LIBS:BOOL=OFF \
-DCMAKE_BUILD_TYPE:STRING=Release \
-DPYTHON_EXECUTABLE:PATH=`which python3` \
-DCMAKE_INSTALL_PREFIX:PATH=../libtorch_cuda \
../..

The libtorch.a and other main libraries were generated successfully!

Static Dependencies for Binaries

However, the binaries failed to build (we disabled BUILD_BINARY). Although linking against libtorch suffices for dynamic linking, static linking requires explicitly listing every dependency. Let me show you an example. In the cmake file, we have

target_link_libraries(torch_python PRIVATE caffe2::mkl)

and

target_link_libraries(torch PUBLIC torch_cpu_library)

so when we use target_link_libraries(code PRIVATE torch), it will only link libtorch_cpu (indirectly through libtorch), but not mkl. For shared libraries, this is fine because code does not directly call functions in mkl, so the linker will not complain but just link mkl at runtime after linking libtorch_cpu. However, for static linking, since code grabs the whole libtorch_cpu.a into its own binary, the binary may contain references to mkl functions but is not linked with mkl, so it results in an “undefined symbol” error.

Note 1: This must be common mistakes for CMake newbies. Let’s use it in next year’s C++ course final exam!
Note 2: Private linking is reasonable for nested dependencies to avoid binary bloat.

To fix this, I had to manually link every dependency—a tedious process. Because the CMake file contains more private links than actually needed, I inspected each link error, identified the missing library, and added a corresponding target_link_libraries(code PRIVATE <lib_name>) entry.

This process is tedious, so I wrote a script to automate it. (It may be helpful!)

#!/bin/bash

# Usage: ./find_internal_symbol.sh <symbol_name> [search_path]
# Example: ./find_internal_symbol.sh dlamch_ /usr/lib

SYMBOL=$1
SEARCH_PATH=${2:-/}

if [ -z "$SYMBOL" ]; then
  echo "Usage: $0 <symbol_name> [search_path]"
  exit 1
fi

echo "Searching for internally defined symbol '$SYMBOL' under '$SEARCH_PATH'..."

find "$SEARCH_PATH" \( -name "*.a" -o -name "*.so" \) -type f 2>/dev/null | while read -r lib; do
  if [[ "$lib" == *.a ]]; then
    SYMBOL_OUTPUT=$(nm "$lib" 2>/dev/null | awk '{if ($2 != "U" && $3 == "'"$SYMBOL"'") print}')
  else
    SYMBOL_OUTPUT=$(nm -g "$lib" 2>/dev/null | awk '{if ($2 != "U" && $3 == "'"$SYMBOL"'") print}')
  fi

  if [ -n "$SYMBOL_OUTPUT" ]; then
    echo "Found in: $lib"
    echo "$SYMBOL_OUTPUT"
    echo
  fi
done

Circular Dependencies

Why were there still undefined symbols even though the missing symbol was present? I discovered that circular dependencies were the culprit. I tried reordering the libraries based on observed dependencies, but it didn’t help. Eventually, I realized circular dependencies were the issue. A common workaround is wrapping libraries with -Wl,--start-group/--end-group, but this alone didn’t resolve it. I then use make VERBOSE=1 to check the linker command and found that, the circular dependencies happen not only in the private dependencies I just added, but also in the public dependencies that are already linked in libtorch.

For example, if I wrap torch mkl_core mkl_gnu_thread into a group, the circular dependencies of the last two libraries will be resolved, but the linker command will append some other libraries like libmagma.a after -Wl,--start-group libtorch.a libmkl_core.a libmkl_gnu_thread.a -Wl,--end-group, which will cause the circular dependencies between magma and mkl. So it is necessary to wrap all the libraries into a group.

I’m not sure this is optimal; it might impact performance.

Finally, I added this line to the CMakeLists.txt:

set(CMAKE_CXX_LINK_EXECUTABLE "<CMAKE_CXX_COMPILER> -Wl,--start-group <OBJECTS> <LINK_LIBRARIES> -Wl,--end-group -o <TARGET>")

Whole Archive

The compilation passed! I happily ran ./bin/code and the executable immediately crashed — an almost tear-inducing moment. The error message was:

terminate called after throwing an instance of 'c10::Error'
  what():  PyTorch is not linked with support for cuda devices
Exception raised from getDeviceGuardImpl at /root/pytorch/c10/core/impl/DeviceGuardImplInterface.h:366 (most recent call first):
frame #0: <unknown function> + 0x1d0f4b9b (0x55755fec9b9b in ./bin/code)
frame #1: <unknown function> + 0x1d0f452c (0x55755fec952c in ./bin/code)

It appeared the CUDA backend dispatcher wasn’t linked correctly. Since this didn’t occur with dynamic linking, I suspected the backend dispatch logic’s registration variables were stripped from the static binary. This is possibly because the registration process is triggered by some variable initialization, which is not executed if the variable is not referenced in my main function. So I read through the dispatcher logic and found

// c10/core/impl/DeviceGuardImplInterface.cpp
DeviceGuardImplRegistrar::DeviceGuardImplRegistrar(
    DeviceType type,
    const DeviceGuardImplInterface* impl) {
  device_guard_impl_registry[static_cast<size_t>(type)].store(impl);
}

// c10/core/impl/DeviceGuardImplInterface.h
#define C10_REGISTER_GUARD_IMPL(DevType, DeviceGuardImpl)              \
  static ::c10::impl::DeviceGuardImplRegistrar C10_ANONYMOUS_VARIABLE( \
      g_##DeviceType)(::c10::DeviceType::DevType, new DeviceGuardImpl());

// c10/cuda/impl/CUDAGuardImpl.cpp
namespace c10::cuda::impl {
C10_REGISTER_GUARD_IMPL(CUDA, CUDAGuardImpl);
} // namespace c10::cuda::impl

This confirmed my suspicion. A common fix is to use -Wl,--whole-archive. However, ~~unlike my course labs~~ applying it to all libraries caused duplicate symbols. I needed to pinpoint the specific library containing the registration logic. Using my script above, I located it in libc10_cuda.a and added -Wl,--whole-archive -lc10_cuda -Wl,--no-whole-archive to the link line.

Finally, I can run the code successfully… I really wanted to shout - but it was 3am in the morning. So I can just shout at my mentor online lol.