CU_EVENT_DEFAULT with CU_EVENT_DISABLE_TIMING for better performance of CUDA events--serve (on port 8080 by default) and attach a client to it with --attach http://127.0.0.1:8080--archiveflag. Generate an offline profile and view it either via --attach or by uploading it to a server and navigating to https://legion.stanford.edu/prof-viewer/?url=...