性能评估#
参考:profiler & tensorboard_profiler
这个菜谱解释了如何使用 PyTorch profiler 并度量模型运算的时间和内存消耗。
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
定义简单的模型:
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
时间分析#
PyTorch profiler 是通过上下文管理器启用的,并接受一些参数,其中一些最有用的是:
activities
(profile 的活动列表):ProfilerActivity.CPU
:PyTorch 算子,TorchScript 函数和用户定义的代码标签(参见下面的record_function
);ProfilerActivity.CUDA
:CUDA 内核。
record_shapes
:是否记录算子输入的形状;profile_memory
:是否报告模型的张量消耗内存量;use_cuda
:是否度量 CUDA 内核的执行时间。
备注
当使用 CUDA 时,分析器还会显示主机上发生的 CUDA 运行时事件。
看看如何使用 profiler 来分析执行时间:
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
model(inputs)
备注
可以使用 record_function
上下文管理器用用户提供的名称标记任意代码范围(model_inference
在上面的例子中用作 label)。
Profiler 允许检查在使用 Profiler 上下文管理器包装的代码范围执行期间调用了哪些算子。如果多个分析器范围同时处于活动状态(例如在并行的 PyTorch 线程中),每个 profiler 上下文管理器只跟踪其对应范围的算子。Profiler 还会自动剖析使用 torch.jit._fork
启动的异步任务和(在向后传递的情况下)使用 backward()
调用启动的向后传递算子。
打印出上面执行的统计数据:
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
model_inference 8.04% 5.637ms 99.93% 70.030ms 70.030ms 1
aten::conv2d 0.17% 119.000us 72.91% 51.093ms 2.555ms 20
aten::convolution 0.43% 301.000us 72.74% 50.974ms 2.549ms 20
aten::_convolution 0.28% 194.000us 72.31% 50.673ms 2.534ms 20
aten::mkldnn_convolution 71.70% 50.246ms 72.03% 50.479ms 2.524ms 20
aten::batch_norm 0.10% 67.000us 11.56% 8.100ms 405.000us 20
aten::_batch_norm_impl_index 0.22% 152.000us 11.46% 8.033ms 401.650us 20
aten::native_batch_norm 10.70% 7.500ms 11.20% 7.846ms 392.300us 20
aten::max_pool2d 0.04% 27.000us 3.68% 2.579ms 2.579ms 1
aten::max_pool2d_with_indices 3.64% 2.552ms 3.64% 2.552ms 2.552ms 1
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 70.079ms
在这里可以看到,正如预期的那样,大部分时间都花在卷积上(特别是在 mkldnn_convolution
上,因为 PyTorch 编译时支持 MKL-DNN)。注意自定义 cpu 时间和 cpu 时间之间的区别——算子可以调用其他算子,自定义 cpu 时间不包括子算子调用的时间,而总 cpu 时间包括它。通过将 sort_by="self_cpu_time_total"
传递到 table
调用中,可以选择根据自 cpu 时间进行排序。
为了获得更细粒度的结果并包含算子输入形状,传递 group_by_input_shape=True
(注意:这需要运行带有 record_shapes=True
的 profiler):
print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total", row_limit=10))
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
model_inference 8.04% 5.637ms 99.93% 70.030ms 70.030ms 1 []
aten::conv2d 0.03% 22.000us 15.04% 10.543ms 2.636ms 4 [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], []]
aten::convolution 0.08% 58.000us 15.01% 10.521ms 2.630ms 4 [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], [], [], []]
aten::_convolution 0.06% 41.000us 14.93% 10.463ms 2.616ms 4 [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], [], [], [], [], [], [], []]
aten::mkldnn_convolution 14.77% 10.348ms 14.87% 10.422ms 2.606ms 4 [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], []]
aten::conv2d 0.03% 24.000us 13.72% 9.613ms 9.613ms 1 [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], []]
aten::convolution 0.07% 46.000us 13.68% 9.589ms 9.589ms 1 [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], [], [], []]
aten::_convolution 0.03% 22.000us 13.62% 9.543ms 9.543ms 1 [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], [], [], [], [], [], [], []]
aten::mkldnn_convolution 13.55% 9.499ms 13.59% 9.521ms 9.521ms 1 [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], []]
aten::conv2d 0.02% 15.000us 12.70% 8.900ms 2.967ms 3 [[5, 512, 7, 7], [512, 512, 3, 3], [], [], [], [], []]
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Self CPU time total: 70.079ms
注意在不同的输入形状下出现两次 aten::convolution
。
Profiler 还可以用来分析在 GPU 上执行的模型的性能:
model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()
with profile(activities=[
ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 0.14% 2.617ms 100.00% 1.876s 1.876s 0.000us 0.00% 3.562ms 3.562ms 1
aten::conv2d 0.01% 94.000us 94.85% 1.780s 88.977ms 0.000us 0.00% 2.897ms 144.850us 20
aten::convolution 0.01% 229.000us 94.85% 1.779s 88.972ms 0.000us 0.00% 2.897ms 144.850us 20
aten::_convolution 0.01% 143.000us 94.83% 1.779s 88.961ms 0.000us 0.00% 2.897ms 144.850us 20
aten::cudnn_convolution 4.47% 83.840ms 94.83% 1.779s 88.954ms 2.897ms 81.33% 2.897ms 144.850us 20
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148... 0.00% 0.000us 0.00% 0.000us 0.000us 1.288ms 36.16% 1.288ms 128.800us 10
volta_sgemm_64x64_nn 0.00% 0.000us 0.00% 0.000us 0.000us 536.000us 15.05% 536.000us 178.667us 3
aten::batch_norm 0.00% 46.000us 4.90% 91.996ms 4.600ms 0.000us 0.00% 353.000us 17.650us 20
aten::_batch_norm_impl_index 0.01% 98.000us 4.90% 91.950ms 4.598ms 0.000us 0.00% 353.000us 17.650us 20
aten::cudnn_batch_norm 0.11% 2.017ms 4.90% 91.852ms 4.593ms 353.000us 9.91% 353.000us 17.650us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.876s
Self CUDA time total: 3.562ms
备注
第一次使用 CUDA 分析可能会带来额外的开销。
输出中出现的设备上内核(例如
sgemm_64x64_nn
)。
分析内存消耗#
PyTorch 分析器还可以显示在执行模型算子期间分配(或释放)的内存量(由模型张量使用)。在下面的输出中,'self' 内存对应于算子分配(释放)的内存,不包括对其他算子的子调用。通过 profile_memory=True
来启用内存分析功能。
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profile(activities=[ProfilerActivity.CPU],
profile_memory=True, record_shapes=True) as prof:
model(inputs)
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::empty 0.83% 522.000us 0.83% 522.000us 2.610us 94.86 Mb 94.86 Mb 200
aten::max_pool2d_with_indices 5.94% 3.745ms 5.94% 3.745ms 3.745ms 11.48 Mb 11.48 Mb 1
aten::addmm 0.42% 264.000us 0.45% 286.000us 286.000us 19.53 Kb 19.53 Kb 1
aten::mean 0.06% 39.000us 0.24% 154.000us 154.000us 10.00 Kb 10.00 Kb 1
aten::empty_strided 0.01% 5.000us 0.01% 5.000us 5.000us 4 b 4 b 1
aten::conv2d 0.15% 93.000us 77.98% 49.160ms 2.458ms 47.37 Mb 0 b 20
aten::convolution 0.44% 278.000us 77.83% 49.067ms 2.453ms 47.37 Mb 0 b 20
aten::_convolution 0.31% 197.000us 77.39% 48.789ms 2.439ms 47.37 Mb 0 b 20
aten::mkldnn_convolution 76.72% 48.370ms 77.08% 48.592ms 2.430ms 47.37 Mb 0 b 20
aten::as_strided_ 0.05% 32.000us 0.05% 32.000us 1.600us 0 b 0 b 20
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 63.045ms
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::empty 0.83% 522.000us 0.83% 522.000us 2.610us 94.86 Mb 94.86 Mb 200
aten::batch_norm 0.10% 62.000us 12.56% 7.920ms 396.000us 47.41 Mb 0 b 20
aten::_batch_norm_impl_index 0.19% 119.000us 12.46% 7.858ms 392.900us 47.41 Mb 0 b 20
aten::native_batch_norm 11.65% 7.342ms 12.22% 7.705ms 385.250us 47.41 Mb -75.00 Kb 20
aten::conv2d 0.15% 93.000us 77.98% 49.160ms 2.458ms 47.37 Mb 0 b 20
aten::convolution 0.44% 278.000us 77.83% 49.067ms 2.453ms 47.37 Mb 0 b 20
aten::_convolution 0.31% 197.000us 77.39% 48.789ms 2.439ms 47.37 Mb 0 b 20
aten::mkldnn_convolution 76.72% 48.370ms 77.08% 48.592ms 2.430ms 47.37 Mb 0 b 20
aten::empty_like 0.10% 65.000us 0.23% 144.000us 7.200us 47.37 Mb 0 b 20
aten::max_pool2d 0.01% 5.000us 5.95% 3.750ms 3.750ms 11.48 Mb 0 b 1
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 63.045ms
使用追踪功能#
分析结果可以输出为 .json
追踪文件:
model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
model(inputs)
prof.export_chrome_trace("trace.json")
你可以在 Chrome 跟踪查看器(Chrome://tracing
)中检查分析算子和 CUDA 内核的序列。
检查堆栈跟踪#
Profiler 可以用来分析 Python 和 TorchScript 堆栈跟踪:
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
) as prof:
model(inputs)
# Print aggregated stats
print(prof.key_averages(group_by_stack_n=5).table(sort_by="self_cuda_time_total", row_limit=2))
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -----------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Source Location
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -----------------------------------------------------------------------
aten::cudnn_convolution 16.43% 2.415ms 74.21% 10.909ms 545.450us 2.942ms 81.41% 2.942ms 147.100us 20 runpy.py(75): _run_code
site-packages/ipykernel_launcher.py(12): <module>
site-packages/traitlets/config/application.py(974): launch_instance
site-packages/ipykernel/kernelapp.py(702): start
site-packages/tornado/platform/asyncio.py(195): start
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148... 0.00% 0.000us 0.00% 0.000us 0.000us 1.323ms 36.61% 1.323ms 132.300us 10
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -----------------------------------------------------------------------
Self CPU time total: 14.701ms
Self CUDA time total: 3.614ms
注意 torchvision/models/resnet.py
脚本中的两个卷积和两个调用站点。
将数据可视化为 flamegraph#
执行时间(self_cpu_time_total
和 self_cuda_time_total
指标)和堆栈跟踪也可以可视化为 flamegraph。为此,首先使用 export_stacks
导出原始数据(需要 with_stack=True
):
prof.export_stacks("/tmp/profiler_stacks.txt", "self_cuda_time_total")
小技巧
建议使用例如 firegraph 工具来生成交互式 SVG:
# git clone https://github.com/brendangregg/FlameGraph
# cd FlameGraph
# ./flamegraph.pl --title "CUDA time" --countname "us." /tmp/profiler_stacks.txt > perf_viz.svg
分析 long-running jobs#
PyTorch 分析器提供了额外的 API 来处理长时间运行的作业(比如训练循环)。跟踪所有的执行可能会很慢,并导致非常大的跟踪文件。为了避免这种情况,使用可选参数:
schedule
:指定一个函数,它接受一个整数参数(步数)作为输入并为分析器返回一个 action,使用该参数的最佳方式是使用torch.profiler.schedule
辅助函数,它可以为您生成一个调度;on_trace_ready
:指定一个函数,该函数接受对分析程序的引用作为输入,并在每次新跟踪就绪时由分析程序调用。
from torch.profiler import schedule
my_schedule = schedule(
skip_first=10,
wait=5,
warmup=1,
active=3,
repeat=2)
def trace_handler(p):
output = p.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)
print(output)
p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2),
on_trace_ready=trace_handler
) as p:
for idx in range(8):
model(inputs)
p.step()
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::cudnn_convolution 23.10% 4.721ms 29.95% 6.122ms 153.050us 5.930ms 80.85% 5.930ms 148.250us 40
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148... 0.00% 0.000us 0.00% 0.000us 0.000us 2.639ms 35.98% 2.639ms 131.950us 20
volta_sgemm_64x64_nn 0.00% 0.000us 0.00% 0.000us 0.000us 1.094ms 14.91% 1.094ms 182.333us 6
aten::cudnn_batch_norm 10.14% 2.073ms 19.91% 4.070ms 101.750us 746.000us 10.17% 746.000us 18.650us 40
void cudnn::winograd_nonfused::winogradForwardFilter... 0.00% 0.000us 0.00% 0.000us 0.000us 686.000us 9.35% 686.000us 114.333us 6
volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_medium_n... 0.00% 0.000us 0.00% 0.000us 0.000us 429.000us 5.85% 429.000us 107.250us 4
void cudnn::bn_fw_tr_1C11_singleread<float, 512, tru... 0.00% 0.000us 0.00% 0.000us 0.000us 348.000us 4.74% 348.000us 11.600us 30
volta_scudnn_128x64_relu_medium_nn_v1 0.00% 0.000us 0.00% 0.000us 0.000us 344.000us 4.69% 344.000us 86.000us 4
aten::clamp_min_ 2.36% 482.000us 4.47% 914.000us 26.882us 281.000us 3.83% 281.000us 8.265us 34
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 281.000us 3.83% 281.000us 8.265us 34
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 20.440ms
Self CUDA time total: 7.335ms
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::cudnn_convolution 23.45% 3.215ms 30.18% 4.137ms 103.425us 5.790ms 80.60% 5.790ms 144.750us 40
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148... 0.00% 0.000us 0.00% 0.000us 0.000us 2.578ms 35.89% 2.578ms 128.900us 20
volta_sgemm_64x64_nn 0.00% 0.000us 0.00% 0.000us 0.000us 1.061ms 14.77% 1.061ms 176.833us 6
aten::cudnn_batch_norm 10.26% 1.406ms 20.45% 2.803ms 70.075us 739.000us 10.29% 739.000us 18.475us 40
void cudnn::winograd_nonfused::winogradForwardFilter... 0.00% 0.000us 0.00% 0.000us 0.000us 676.000us 9.41% 676.000us 112.667us 6
volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_medium_n... 0.00% 0.000us 0.00% 0.000us 0.000us 416.000us 5.79% 416.000us 104.000us 4
void cudnn::bn_fw_tr_1C11_singleread<float, 512, tru... 0.00% 0.000us 0.00% 0.000us 0.000us 339.000us 4.72% 339.000us 11.300us 30
volta_scudnn_128x64_relu_medium_nn_v1 0.00% 0.000us 0.00% 0.000us 0.000us 333.000us 4.64% 333.000us 83.250us 4
aten::clamp_min_ 2.27% 311.000us 4.44% 608.000us 17.882us 275.000us 3.83% 275.000us 8.088us 34
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 275.000us 3.83% 275.000us 8.088us 34
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 13.709ms
Self CUDA time total: 7.184ms