文章目录[隐藏]

问题

在Pascal voc和coco上训练Faster RCNN都正常
在训练自己的数据集时（Pascal voc格式）训练Faster R-CNN pytorch1.0时出现Warning: NaN or Inf found in input tensor.

原因

可能是learning rate太大，调小learning rate。最有效的方法是learning rate设为0，看看是不是还有nan的问题。
因为自己的数据是从0开始的，但是源码中-1，如果这时候annotation中有为0的就会出现越界的问题。

解决

设置lr=0，如果不在出现loss=nan的问题，说明是learning rate太大，导致了梯度爆炸或梯度消失。可调整learning rate和weight decay。

如果lr=0后，依然存在loss=nan的问题，就修改pascal_voc.py中获取坐标框的代码：

修改前
        bbox = obj.find('bndbox')
        # Make pixel indexes 0-based
        x1 = float(bbox.find('xmin').text) - 1
        y1 = float(bbox.find('ymin').text) - 1
        x2 = float(bbox.find('xmax').text) - 1
        y2 = float(bbox.find('ymax').text) - 1
修改后
        bbox = obj.find('bndbox')
        # Make pixel indexes 0-based
        x1 = float(bbox.find('xmin').text) # - 1
        y1 = float(bbox.find('ymin').text) # - 1
        x2 = float(bbox.find('xmax').text) # - 1
        y2 = float(bbox.find('ymax').text) # - 1

若设置了翻转（cfg.TRAIN.USE_FLIPPED = True），则需要在imdb.py中的def append_flipped_images(self)方法：

修改前      
    boxes[:, 0] = widths[i] - oldx2  - 1
    boxes[:, 2] = widths[i] - oldx1  - 1
修改后	 
    boxes[:, 0] = widths[i] - oldx2  # - 1
    boxes[:, 2] = widths[i] - oldx1  # - 1