PyTorch Inplace Operation Notes

最近在跑实验的时候遇到了这样一个Bug花了很久才解决,记录一下学习一波以免以后再遇到。

1
2
3
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3
, 64, 7, 7]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute it
s gradient, with torch.autograd.set_detect_anomaly(True).

关于Inplace Operation

首先,这个问题从报错中可以看到是由于inplace operation导致的。经过查找,inplace operation指的就是PyTorch在计算一个值的时候不创建新的变量进行复制,而是直接改变原来的变量的值。比如以下代码,第一个就不是inplace operation而后两个就是inplace operation。

1
2
3
4
5
6
7
x = torch.rand(2)
y = torch.rand(2)
# non-inplace operation
z = x + y
# inplace operation
x.add_(y)
x += y

解决过程

首先在网上搜到的资料都是说尽量移除掉所有的inplace operation,包括把Activation Layer的inplace=True去掉等等,但是都无法解决。然后就根据报错中的提示,使用torch.autograd.set_detect_anomaly(True)查找在哪里发生了inplace operation (参考Debugging feature for "modified by an inplace operation" errors · Issue #15803 · pytorch/pytorch · GitHub),最后提示如下报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
File "./train.py", line 44, in <module>
model.optimize_parameters()
File "xxxx.py", line 374, in optimize_parameters
self.backward_all_net()
File "xxxx.py", line 273, in backward_all_net
_, _, _, fake_B_x6_processed = self.pixelnet(fake_B_x6)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "xxxx.py", line 282, in forward
output = self.model_3(feature_256)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 443, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
self.padding, self.dilation, self.groups)
(function _print_stack)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3
, 64, 7, 7]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute i
ts gradient. The variable in question was changed in there or anywhere later. Good luck!

然后发现貌似是在conv层发生了inplace operation,但是实在没有头绪,最后又找到一篇博文与我的情况非常类似(RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation - Js2Hou - 博客园 (cnblogs.com)),最后发现是代码执行的顺序问题,原本的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def optimize_parameters(self):

self.backward_all_net()

self.optimizer_gridnet.zero_grad()
self.loss_gridnet.backward(retain_graph=True)
self.optimizer_gridnet.step()

self.optimizer_pixelnet.zero_grad()
self.loss_pixelnet.backward(retain_graph=True)
self.optimizer_pixelnet.step()

self.optimizer_depixelnet.zero_grad()
self.loss_depixelnet.backward()
self.optimizer_depixelnet.step() # 出错处

self.optimizer_D_gridnet.zero_grad()
self.backward_D_gridnet()
self.optimizer_D_gridnet.step()

self.optimizer_D_pixelnet.zero_grad()
self.backward_D_pixelnet()
self.optimizer_D_pixelnet.step()

self.optimizer_D_depixelnet.zero_grad()
self.backward_D_depixelnet()
self.optimizer_D_depixelnet.step()

修改后的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def optimize_parameters(self):
self.optimizer_gridnet.zero_grad()
self.optimizer_pixelnet.zero_grad()
self.optimizer_depixelnet.zero_grad()
self.optimizer_D_gridnet.zero_grad()
self.optimizer_D_pixelnet.zero_grad()
self.optimizer_D_depixelnet.zero_grad()

self.backward_all_net()

self.loss_gridnet.backward(retain_graph=True)
self.loss_pixelnet.backward(retain_graph=True)
self.loss_depixelnet.backward()

self.optimizer_gridnet.step()
self.optimizer_pixelnet.step()
self.optimizer_depixelnet.step()

self.backward_D_gridnet()
self.backward_D_pixelnet()
self.backward_D_depixelnet()

self.optimizer_D_gridnet.step()
self.optimizer_D_pixelnet.step()
self.optimizer_D_depixelnet.step()

终于顺利解决!!!

参考资料

  1. 关于 pytorch inplace operation, 需要知道的几件事 - 知乎 (zhihu.com)

  2. 在PyTorch中in-place operation的含义_York1996的博客-CSDN博客_inplace operation

  3. Debugging feature for "modified by an inplace operation" errors · Issue #15803 · pytorch/pytorch · GitHub

  4. RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation - Js2Hou - 博客园 (cnblogs.com)