Faster-RCNN论文及原码解读

文章目录[隐藏]

一、模型的整体框架
二、网络结构

论文：
链接

原码：
链接

一、模型的整体框架

在这里插入图片描述
从上图可以看出，算法整体可以分为四个阶段：
1、conv layers：提取特征图，Faster-RCNN首先用一组基础的网络结构conv+relu+pooling层来提取input image的feature maps，提取出的feature map用于后续的RPN和ROI Pooling。例如backbone网络为VGG16，网络结构为13个conv+13个relu+4个pooling层组成。

2、Region Proposal Network：RPN网络主要用于生成region proposals，首先生成一堆的anchor，然后对其进行裁剪过滤通过softmax判断anchors是属于前景（foreground）还是后景（background），即是物体or不是物体，所以这是一个二分类；同时，另一支bounding box regression修正anchor box，形成较为精确的proposal。

3、ROI Pooling：利用RPN生成的proposals和backbone网络最后一层得到的feature map，得到固定大小的proposal feature map，进入后面的全连接层进行目标的识别和定位。

4、Classifer：将ROI Pooling层形成固定大小的feature map进行全连接操作，利用softmax进行具体类别的分类，同时，利用L1 Loss完成bounding box 回归操作获得物体的精确定位。

二、网络结构

在这里插入图片描述

2.1 Conv Layers

【Faster-rcnn读取图像尺寸问题】
Faster rcnn一般对输入图像的大小尺寸限制为：最小边为600，最大边为1000，
假定输入图像尺寸为：H×W

【backbone 结构】
以VGG16为例：
13个conv：kernel_size = 3, padding = 1, stride = 1;（经过卷积层，图片的尺寸大小不变）
+
13个relu：激活函数，不改变图片的大小；
+
4个pooling：kernel_size = 2,stride = 2；pooling层会让图片的尺寸变为原来的1/2。

经过conv layer图片的尺寸变为（H/16）*（W/16），即M×N，输出的feature map的大小为M×N×512-d（注：VGG是512-d，ZF是256-d）表示特征图的大小为M×N，维度即数量是512.
卷积过程示意图

2.2 RPN(Region Proposal Networks)

RPN主要分为两路：
rpn_cls和rpn_bbox
feature map进入RPN网络后，先经过一次3×3的卷积，同样，特征图的大小依然是M×N×512，这样做的目的是进一步集中特征信息，接着分两支分别进行两个1×1的卷积，即kernel_size=1,padding=0,stride=1,一支是18-d，一支是36-d。
在这里插入图片描述
1）rpn_cls:
M×N×512 * 1×1×512×18 -> M×N×18
2）rpn_bbox:
M×N×512 * 1×1×512×36 ->M×N×36

2.2.1 Anchors box的生成

图片经过Conv Layers变为原来的1/16，令feat_stride=16,在生成anchor时，先定义一个base anchor，大小为16×16的box（feature map上的一个点的感受野对应原始图像上是一块区域，这里设置16，是因为feature map上的一点对应原始图像16×16大小的区域），源码转化为[0, 0, 15, 15]的数组，然后设置长宽比和面积比分别是[1:2, 1:1, 2:1]，这样一个box通过这种比例就可以生成9个box。
（源码中的generate_anchors.py）
1、设置base anchor的大小：

base_anchor = np.array([1, 1, base_size, base_size]) - 1
# base_size=16
# base_anchor = [0, 0, 15, 15]

2、base_anchor=[0, 0, 15, 15],面积保持不变，长、宽比分别是[0.5, 1, 2]时产生的anchor

def _ratio_enum(anchor, ratios):
    """
    Enumerate a set of anchors for each aspect ratio wrt an anchor.
    """
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    size = w * h                 # size=16×16=256
    size_ratios = size / ratios  # size_ratios:[512, 256, 128]
    # np.round(x)返回x的四舍五入数字，np.sqrt(x)返回数字x的平方根
    ws = np.round(np.sqrt(size_ratios))  # ws=[23, 16, 11]
    hs = np.round(ws * ratios)           # hs=[12, 16, 22]
    # 转化为anchor的四个坐标值形式
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

def _whctrs(anchor):
    """
    Return width, height, x center, and y center for an anchor (window).
    """

    w = anchor[2] - anchor[0] + 1       # xmax - xmin + 1
    h = anchor[3] - anchor[1] + 1       # ymax - ymin + 1
    x_ctr = anchor[0] + 0.5 * (w - 1)   # x_center
    y_ctr = anchor[1] + 0.5 * (h - 1)   # y_center
    return w, h, x_ctr, y_ctr

def _mkanchors(ws, hs, x_ctr, y_ctr):
    """
    Given a vector of widths (ws) and heights (hs) around a center
    (x_ctr, y_ctr), output a set of anchors (windows).
   """

    ws = ws[:, np.newaxis]
    hs = hs[:, np.newaxis]
    anchors = np.hstack((x_ctr - 0.5 * (ws - 1), y_ctr - 0.5 * (hs - 1),
                         x_ctr + 0.5 * (ws - 1), y_ctr + 0.5 * (hs - 1)))
    return anchors

在这里插入图片描述
生成的anchor为：

array([[-3.5,  2. , 18.5, 13. ],
       [ 0. ,  0. , 15. , 15. ],
       [ 2.5, -3. , 12.5, 18. ]])

3、经过上面的长宽比变换之后，接下来执行的是面积scales的变换：

# scales = 2 ** np.arange(3, 6) = [8, 16, 32]
anchors = np.vstack([
        _scale_enum(ratio_anchors[i, :], scales) for i in range(ratio_anchors.shape[0])
        ])

上面的_scale_enum()函数的定义如下，对上一步得到的ratio_anchors中的三种宽高比的anchor，再分别进行三种scale的变换，也就是三种宽高比，搭配三种scale，最终会得到9种宽高比和scale 的anchors。这就是论文中每一个点对应的9种anchors。

def _scale_enum(anchor, scales):
    """
    Enumerate a set of anchors for each scale wrt an anchor.
    """
    w, h, x_ctr, y_ctr = _whctrs(anchor)
    ws = w * scales
    hs = h * scales
    anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
    return anchors

_scale_enum函数中也是首先将宽高比变换后的每一个ratio_anchor转化成（宽，高，中心点横坐标，中心点纵坐标）的形式，再对宽和高均进行scale倍的放大，然后再转换成四个坐标值的形式。最终经过宽高比和scale变换得到的9种尺寸的anchors的坐标如下：

array([[ -83.,  -39.,  100.,   56.],
      [-175.,  -87.,  192.,  104.],
      [-359., -183.,  376.,  200.],
      [ -55.,  -55.,   72.,   72.],
      [-119., -119.,  136.,  136.],
      [-247., -247.,  264.,  264.],
      [ -35.,  -79.,   52.,   96.],
      [ -79., -167.,   96.,  184.],
      [-167., -343.,  184.,  360.]])

图片显示这9个anchor如下：
在这里插入图片描述
上面描述了feature map上的一个点生成9个anchor的过程，对feature map上的每一个点都要生成9个anchor，即要生成M×N×9个anchor。（源码对应snippets.py）

def generate_anchors_pre(height,
                         width,
                         feat_stride,
                         anchor_scales=(8, 16, 32),
                         anchor_ratios=(0.5, 1, 2)):
    """ 
    A wrapper function to generate anchors given different scales
    Also return the number of anchors in variable 'length'
    """
    # 生成长宽比和面积比不同的9个anchor
    anchors = generate_anchors(
        ratios=np.array(anchor_ratios), scales=np.array(anchor_scales))
    # A = 9
    A = anchors.shape[0]
    # 横向偏移量（0，16，32，...）
    shift_x = np.arange(0, width) * feat_stride
    # 纵向偏移量（0，16，32，...）
    shift_y = np.arange(0, height) * feat_stride
    
    """
    shift_x = [[0，16，32，..],[0，16，32，..],[0，16，32，..]...],
    shift_y = [[0，0，0，..],[16，16，16，..],[32，32，32，..]...],
    就是形成了一个纵横向偏移量的矩阵，也就是特征图的每一点都能够通过这个
    矩阵找到映射在原图中的具体位置！
    """
    shift_x, shift_y = np.meshgrid(shift_x, shift_y)
    """
    经过刚才的变化，其实大shift_x, shift_y的元素个数已经相同，看矩阵的结构也能看出，
    矩阵大小是相同的，sift_x.ravel()之后变成一行，此时shift_x,shift_y的元
    素个数是相同的，都等于特征图的长宽的乘积(像素点个数)，不同的是此时
    的shift_x里面装得是横向看的x的一行一行的偏移坐标，而此时的y里面装
    的是对应的纵向的偏移坐标！
    """
    # shift_x.ravel():(M×N,) shift_y.ravel():(M×N,)
    # transpose:(4, M×N) -> (M×N, 4)
    shifts = np.vstack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(),
                        shift_y.ravel())).transpose()

    # 读取特征图中元素的总个数
    K = shifts.shape[0]
    """
    width changes faster, so here it is H, W, C
    用基础的9个anchor的坐标分别和偏移量相加，最后得出了所有的anchor的坐标，
    四列可以堪称是左上角的坐标和右下角的坐标加偏移量的同步执行，飞速的从
    上往下捋一遍，所有的anchor就都出来了！一共K个特征点，每一个有A(9)个
    基本的anchor，所以最后reshape((K*A),4)的形式，也就得到了最后的所有
    的anchor左下角和右上角坐标.
    """
    anchors = anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2))
    anchors = anchors.reshape((K * A, 4)).astype(np.float32, copy=False)  
    length = np.int32(anchors.shape[0])

    return anchors, length

特征图的大小是M×N，所以一共生成M×N×9个anchor box

【总结：从9个base anchor 如何生成M×N×9个anchor】
通过width：(0-60)*16, height:(0-40)*16, 建立shift偏移量数组，再和base_anchor的基准数组累加，得到特征图上所有像素对应的anchor的坐标值，是一个[K * A, 4]数组。

以上生成anchor box的过程可总结如下：
在这里插入图片描述

2.2.2 RPN实现原理

caffe版本的网络模型结构：
在这里插入图片描述
rpn网络结构的实现：(源码对应network.py)

# 以feature map:[1, 1024, 29, 63]为例
def _region_proposal(self, net_conv):

    # *************分类网络：判断前景还是背景*************
    # feature map:net_conv
    # self.rpn_net:Conv2d(1024, 512, kernel_size=[3, 3], stride=(1, 1), padding=(1, 1))
    # rpn:[1, 1024, 29, 63] -> [1, 512, 29, 63]
    rpn = F.relu(self.rpn_net(net_conv))
    self._act_summaries['rpn'] = rpn

    # self.rpn_cls_score_net:Conv2d(521, 18, kernel_size=[1, 1], stride=(1,1))
    # rpn_cls_score:[1, 18, 29, 63]
    rpn_cls_score = self.rpn_cls_score_net(rpn)  # batch * (num_anchors * 2) * h * w

    # change it so that the score has 2 as its channel size
    # rpn_cls_score_reshape:[1, 2, 9×29, 63]
    rpn_cls_score_reshape = rpn_cls_score.view(1, 2, -1, rpn_cls_score.size()[-1])  # batch * 2 * (num_anchors*h) * w
    # rpn_cls_prob_reshape:[1, 2, 9×29, 63]
    rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape, dim=1)

    # Move channel to the last dimenstion, to fit the input of python functions
    # rpn_cls_prob:[1, 29, 63, 18]
    rpn_cls_prob = rpn_cls_prob_reshape.view_as(rpn_cls_score).permute(0, 2, 3, 1)  # batch * h * w * (num_anchors * 2)
    # rpn_cls_score:[1, 29, 63, 18]
    rpn_cls_score = rpn_cls_score.permute(0, 2, 3, 1)  # batch * h * w * (num_anchors * 2)
    # rpn_cls_score_reshape:[1, 9×29, 63, 2]
    rpn_cls_score_reshape = rpn_cls_score_reshape.permute(0, 2, 3, 1).contiguous()  # batch * (num_anchors*h) * w * 2
    # rpn_cls_pred:[16443] batch*num_anchors*h*w
    rpn_cls_pred = torch.max(rpn_cls_score_reshape.view(-1, 2), 1)[1]


    # ***********回归网络：标记框的坐标 ***********
    # self.rpn_bbox_pred_net:Conv2d(512, 36, kernel_size=[1,1], stride=(1,1))
    # rpn_bbox_pred:[1, 512, 29, 63] -> [1, 36, 29, 63]
    rpn_bbox_pred = self.rpn_bbox_pred_net(rpn)
    # rpn_bbox_pred:[1, 29, 63, 36]
    rpn_bbox_pred = rpn_bbox_pred.permute(0, 2, 3, 1).contiguous()  # batch * h * w * (num_anchors*4)

    if self._mode == 'TRAIN':
        # rois:[2000,5], roi_scores:[2000, 1]
        # 对rpn输出的结果offset对anchor进行更新，经过排序和nms选择前2000个框及对应的得分
        rois, roi_scores = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred)  # rois, roi_scores are varible

        # rpn_labels:
        rpn_labels = self._anchor_target_layer(rpn_cls_score)
        rois, _ = self._proposal_target_layer(rois, roi_scores)
    else:
        if cfg.TEST.MODE == 'nms':
            rois, _ = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred)
        elif cfg.TEST.MODE == 'top':
            rois, _ = self._proposal_top_layer(rpn_cls_prob, rpn_bbox_pred)
        else:
            raise NotImplementedError

            self._predictions["rpn_cls_score"] = rpn_cls_score
            self._predictions["rpn_cls_score_reshape"] = rpn_cls_score_reshape
            self._predictions["rpn_cls_prob"] = rpn_cls_prob
            self._predictions["rpn_cls_pred"] = rpn_cls_pred
            self._predictions["rpn_bbox_pred"] = rpn_bbox_pred
            self._predictions["rois"] = rois

            return rois

对上述RPN网络过程总结如下：
在这里插入图片描述

2.2.2.1 rpn-data

layer {  
      name: 'rpn-data'  
      type: 'Python'  
      bottom: 'rpn_cls_score'   #仅提供特征图的height和width的参数大小
      bottom: 'gt_boxes'        #ground truth box
      bottom: 'im_info'         #包含图片大小和缩放比例，可供过滤anchor box
      bottom: 'data'  
      top: 'rpn_labels'  
      top: 'rpn_bbox_targets'  
      top: 'rpn_bbox_inside_weights'  
      top: 'rpn_bbox_outside_weights'  
      python_param {  
        module: 'rpn.anchor_target_layer'  
        layer: 'AnchorTargetLayer'  
        param_str: "'feat_stride': 16 \n'scales': !!python/tuple [8, 16, 32]"  
      }  
    }

这一层主要是对M×N的特征图上的每一个像素点生成的9个anchor box，进行过滤、标记，规则如下：（对应原码anchor_target_layer.py）

1、去除掉超过原图边界的anchor box；
2、如果anchor box与ground truth box之间的iou值最大，标记为正样本，label=1
3、如果anchor box与ground truth box之间的iou>0.7,标记为正样本，label=1
4、如果anchor box与ground turth box之间的iou<0.3,标记为负样本，label=0
5、剩余的既不是正样本也不是负样本，不用参与最终的训练，label=-1

除了对anchor box标记外，还要计算anchor box与ground truth box之间的偏移量，并设置对应的权重

令：ground truth:标定的框也对应一个中心点位置坐标x*,y*和宽高w*,h*

    anchor box: 中心点位置坐标x_a,y_a和宽高w_a,h_a

    所以，偏移量：

    △x=(x*-x_a)/w_a   △y=(y*-y_a)/h_a 

    △w=log(w*/w_a)   △h=log(h*/h_a)

通过ground truth box与anchor box之间的偏移量来进行学习，从而是RPN网络中的权重能够学习预测bbox的能力。
对应原码如下：

def anchor_target_layer(rpn_cls_score, gt_boxes, im_info, _feat_stride,
                        all_anchors, num_anchors):
    
    # rpn_cls_score:[1, 29, 63, 18],gt_boxes:[x1,y1,x2,y2,cls],im_info:[H, W, scale]
    # _feat_stride:[16], all_anchors:[1×29×63×9,4], num_anchors:9
    
    """Same as the anchor target layer in original Fast/er RCNN """

    A = num_anchors  # 特征图上的每个像素点对应9个anchor，num_anchors:9
    total_anchors = all_anchors.shape[0]  # 29*63*9
    K = total_anchors / num_anchors  # 29*63

    # allow boxes to sit over the edge by a small amount
    _allowed_border = 0

    # map of shape (..., H, W)
    height, width = rpn_cls_score.shape[1:3]
    # 高、宽分别是29, 63

    # only keep anchors inside the image
    # 只保留图像内部的anchors，即返回四个坐标均在图像范围内的anchor对应的索引
    inds_inside = np.where(
        (all_anchors[:, 0] >= -_allowed_border) &
        (all_anchors[:, 1] >= -_allowed_border) &
        (all_anchors[:, 2] < im_info[1] + _allowed_border) &  # width
        (all_anchors[:, 3] < im_info[0] + _allowed_border)  # height
    )[0]

    # keep only inside anchors
    # 根据inds_inside返回的索引值，选择出在图像内部的anchor
    anchors = all_anchors[inds_inside, :]

    # label: 1 is positive, 0 is negative, -1 is dont care
    labels = np.empty((len(inds_inside), ), dtype=np.float32)
    labels.fill(-1)

    # overlaps between the anchors and the gt boxes
    # overlaps (ex, gt)
    # bbox_overlaps(anchors, gt_boxes) 计算每个anchor和ground truth之间的IOU值
    # anchors(N, 4) gt_boxes(K, 4) 返回值(N, K)
    overlaps = bbox_overlaps(
        np.ascontiguousarray(anchors, dtype=np.float),     # 将内存不连续存储的数组转化为内存连续存储的数组，使得运行速度更快
        np.ascontiguousarray(gt_boxes, dtype=np.float))

    # 求每个anchor最大的overlaps的索引(N,)
    argmax_overlaps = overlaps.argmax(axis=1)

    # 求每个anchor对应最大overlap的值，(N,)
    max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]

    # 求与目标框IOU值最大的anchor的索引(K)
    gt_argmax_overlaps = overlaps.argmax(axis=0)
    # 求与每个目标框IOU最大的anchor 1*K
    gt_max_overlaps = overlaps[gt_argmax_overlaps,
                               np.arange(overlaps.shape[1])]
    # 求K个目标框对应最大IOU的anchor的索引，1*K, 即目标框与哪个anchor的IOU最大
    gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

    # cfg.TRAIN.RPN_CLOBBER_POSITIVES:一个anchor既满足负样本又满足正样本，设置为负样本
    if not cfg.TRAIN.RPN_CLOBBER_POSITIVES:   # false
        # assign bg labels first so that positive labels can clobber them
        # first set the negatives
        # rpn_clobber_positives为0.3，最大IOU小于0.3的设为0
        labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0   # 0.3

    # fg label: for each gt, anchor with highest overlap
    # 与每个目标框IOU值最大的anchor，label设为1
    labels[gt_argmax_overlaps] = 1

    # fg label: above threshold IOU
    # 每个anchor与目标框IOU值大于0.7，label设为1
    labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1  # 0.7

    if cfg.TRAIN.RPN_CLOBBER_POSITIVES:
        # assign bg labels last so that negative labels can clobber positives
        labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

    # subsample positive labels if we have too many
    # 如果正样本太多，降采样，label设为-1,不参与训练
    num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE)  # 正样本max=0.5。 总样本bs=256
    fg_inds = np.where(labels == 1)[0]
    if len(fg_inds) > num_fg:
        disable_inds = npr.choice(
            fg_inds, size=(len(fg_inds) - num_fg), replace=False)
        labels[disable_inds] = -1

    # subsample negative labels if we have too many
    # 如果负样本设置太多，进行降采样，label设为-1，不参与训练
    num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1)
    bg_inds = np.where(labels == 0)[0]
    if len(bg_inds) > num_bg:
        disable_inds = npr.choice(
            bg_inds, size=(len(bg_inds) - num_bg), replace=False)
        labels[disable_inds] = -1

    # bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32)
    # anchor：N*4   gt_boxes: N*5
    # 返回每个anchor的回归参数 N*4 (dx, dy, dw, dh)
    # anchors 是指在图像内部的所有anchor，
    # 计算在图像内部的每个anchor与iou值最大的gt的坐标偏移量
    bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])

    # 对图像内部的每个anchor设置权重
    bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
    # only the positive ones have regression targets
    # label=1的赋值为[1, 1, 1, 1], 其他的是[0, 0, 0, 0]
    bbox_inside_weights[labels == 1, :] = np.array(
        cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)

    bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
    # rpn_positive_weight 默认值是-1
    # RPN的正样本给定权重 p *= 1/{num positives}  负样本权重（1-p）
    if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0:
        # uniform weighting of examples (given non-uniform sampling)
        # 样本数量
        num_examples = np.sum(labels >= 0)
        # 正样本权重
        positive_weights = np.ones((1, 4)) * 1.0 / num_examples
        # 负样本权重
        negative_weights = np.ones((1, 4)) * 1.0 / num_examples
    else:
        # rpn_positive_weight 大于0小于1
        assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &
                (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))
        positive_weights = (
            cfg.TRAIN.RPN_POSITIVE_WEIGHT / np.sum(labels == 1))
        negative_weights = (
            (1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) / np.sum(labels == 0))
    # 正负样本权重分别赋值给bbox_outside_weight
    bbox_outside_weights[labels == 1, :] = positive_weights
    bbox_outside_weights[labels == 0, :] = negative_weights

    # map up to original set of anchors
    # 将label的大小恢复到total_anchors,未赋值正负labels的赋值为-1，大小变为（29*63*9）*1
    labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
    # 将bbox_targets 恢复到total_anchors的大小
    bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
    # 将bbox_inside_weights 恢复到total_anchors的大小
    bbox_inside_weights = _unmap(
        bbox_inside_weights, total_anchors, inds_inside, fill=0)
    # 将bbox_outside_weights 恢复到total_anchors的大小
    bbox_outside_weights = _unmap(
        bbox_outside_weights, total_anchors, inds_inside, fill=0)

    # labels
    # 将labels变形为1*1*(9*29)*63,赋值给rpn_labels
    labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
    labels = labels.reshape((1, 1, A * height, width))
    rpn_labels = labels

    # bbox_targets
    # bbox_targets 变为1*29*63*36，并赋值给rpn_bbox_targets
    bbox_targets = bbox_targets.reshape((1, height, width, A * 4))
    rpn_bbox_targets = bbox_targets

    # bbox_inside_weights
    # 1*29*63*36
    bbox_inside_weights = bbox_inside_weights.reshape((1, height, width, A * 4))

    rpn_bbox_inside_weights = bbox_inside_weights

    # bbox_outside_weights
    # shape [1, 29, 63, 36]
    bbox_outside_weights = bbox_outside_weights.reshape((1, height, width, A * 4))

    rpn_bbox_outside_weights = bbox_outside_weights
    return rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights

输出结果：
rpn_labels:所有anchor的标签，[1, 1, 9×M, N]
rpn_bbox_target:所有anchor与ground truth之间的偏差[dx, dy, dw, dh]， [1, M, N, 36]
rpn_bbox_inside_weights:设置正样本回归loss的权重，默认为1 [1, M, N, 36]
rpn_bbox_outside_weights:用来平衡RPN分类Loss和回归Loss的权重 [1, M, N, 36]

2.2.2.2 rpn_loss_cls,rpn_loss_bbox,rpn_cls_prob

经过上述操作后得到n个anchor box ，而且每个anchor都有自己的label值

rpn_loss_cls：对应softmax损失函数，对anchor box进行分类操作
rpn_loss_bbox：对应smoothL1损失，对anchor box坐标点进行操作，训练偏移的坐标值△x、△y、△w、△h
rpn_cls_prob：计算概率值（用于下一层的nms非极大值抑制），这步操作中有进一步减少了anchor box。

在rpn-data中已经为预测框anchor box进行了标记，并且计算出与gt_boxes之间的偏移量,利用RPN网络进行训练。
得到了上述中的anchor box，就要开始训练了。
在训练RPN时，一个Mini-batch是由一幅图像中任意选取的256个proposal组成的，其中正负样本的比例为1：1。如果正样本不足128，则多用一些负样本以满足有256个Proposal可以用于训练，反之亦然。
对应原码实现过程如下：

# RPN, class loss
# rpn_cls_score:[1, 9×29, 63, 2] -> [1×29×63×9, 2]
rpn_cls_score = self._predictions['rpn_cls_score_reshape'].view(-1, 2)
# rpn_label:[1, 1, 9×29, 63] -> [1×1×9×29×63]
rpn_label = self._anchor_targets['rpn_labels'].view(-1)
# rpn_select:[256]
rpn_select = (rpn_label.data != -1).nonzero().view(-1)
# rpn_cls_score:[256, 2]
rpn_cls_score = rpn_cls_score.index_select(0, rpn_select).contiguous().view(-1, 2)
# rpn_label:[256]
rpn_label = rpn_label.index_select(0, rpn_select).contiguous().view(-1)
rpn_cross_entropy = F.cross_entropy(rpn_cls_score, rpn_label)

# RPN, bbox loss
# rpn_bbox_pred:[1, 29, 63, 36]
rpn_bbox_pred = self._predictions['rpn_bbox_pred']
# rpn_bbox_targets:[1, 29, 63, 36]
rpn_bbox_targets = self._anchor_targets['rpn_bbox_targets']
# rpn_bbox_inside_weights:[1, 29, 63, 36]
rpn_bbox_inside_weights = self._anchor_targets['rpn_bbox_inside_weights']
rpn_bbox_outside_weights = self._anchor_targets['rpn_bbox_outside_weights']

rpn_loss_box = self._smooth_l1_loss(
    rpn_bbox_pred,
    rpn_bbox_targets,
    rpn_bbox_inside_weights,
    rpn_bbox_outside_weights,
    sigma=sigma_rpn,
    dim=[1, 2, 3])

2.2.2.3 proposal

layer {  
      name: 'proposal'  
      type: 'Python'  
     bottom: 'rpn_cls_prob_reshape' 
      bottom: 'rpn_bbox_pred'  # 记录训练好的四个回归值△x, △y, △w, △h
     bottom: 'im_info'  
      top: 'rpn_rois'  
      python_param {  
        module: 'rpn.proposal_layer'  
        layer: 'ProposalLayer'  
        param_str: "'feat_stride': 16 \n'scales': [8, 16, 32]"
      }  
    }

原码中对应（proposal_layer.py）

# 根据RPN的输出结果，提取出所需的目标框roi
def proposal_layer(rpn_cls_prob, rpn_bbox_pred, im_info, cfg_key, _feat_stride,
                   anchors, num_anchors):
    # rpn_cls_prob:[1, 29, 63, 18], rpn_bbox_pred:[1, 29, 63, 36], im_info:[ 460., 1000.,    1.]
    # cfg_key:'TRAIN', anchors:[16443, 4], num_anchors:9
    """
     A simplified version compared to fast/er RCNN
     For details please see the technical report
    """
    if type(cfg_key) == bytes:
        cfg_key = cfg_key.decode('utf-8')
    # train: rpn_pre_nms_top_n:12000   rpn_post_nms_top_n:2000  rpn_nms_thresh:0.7
    # test: rpn_pre_nms_top_n:6000  rpn_post_nms_top_n:300 rpn_nms_thresh:0.7
    pre_nms_topN = cfg[cfg_key].RPN_PRE_NMS_TOP_N
    post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N
    nms_thresh = cfg[cfg_key].RPN_NMS_THRESH

    # Get the scores and bounding boxes
    # rpn_cls_prob对应1*29*63*18，num_anchor:9,
    # rpn_cls_prob第四维中前9个是框属于背景的概率，后9个值是框属于前景的概率值
    scores = rpn_cls_prob[:, :, :, num_anchors:]
    # rpn_bbox_pred转化为二维(29*63*9)*4,每一行的4个元素表示候选区域的范围参数
    rpn_bbox_pred = rpn_bbox_pred.view((-1, 4))
    # scores转化为二维（29*63*9）*1，每一行1个元素，包含目标检测的概率
    scores = scores.contiguous().view(-1, 1)

    # bbox_transform_inv:结合rpn的输出更新初始框的坐标，得到第一次变换坐标后的proposal
    proposals = bbox_transform_inv(anchors, rpn_bbox_pred)
    # 将超出图像边界的proposal进行裁剪，使之在图像的边界之内，将图像边界之外的坐标值设为0
    proposals = clip_boxes(proposals, im_info[:2])

    # Pick the top region proposals
    # 将score以降序排列，返回值是得分和索引值
    scores, order = scores.view(-1).sort(descending=True)
    if pre_nms_topN > 0:
        order = order[:pre_nms_topN]
        scores = scores[:pre_nms_topN].view(-1, 1)
    # 按目标得分选出proposal,第二次剔除
    proposals = proposals[order.data, :]

    # Non-maximal suppression
    # 使用nms算法排除重复的框，第三次剔除，返回的是索引值
    keep = nms(proposals, scores.squeeze(1), nms_thresh)

    # Pick th top region proposals after NMS
    if post_nms_topN > 0:
        # 只保留前2000个框，第四次剔除
        keep = keep[:post_nms_topN]
        # 选择检测为目标的分数排名在post_nms_topN(训练时为2000，测试时为300)的框的索引
    proposals = proposals[keep, :]
    scores = scores[keep, ]

    # Only support single image as input
    # 因为要进行roi_pooling,在保留框的坐标信息前面插入batch中图片信息的编号，由于batch_size为1，因此都插入0
    batch_inds = proposals.new_zeros(proposals.size(0), 1)
    blob = torch.cat((batch_inds, proposals), 1)
    # blob:[2000, 5] scores:[2000, 1]
    return blob, scores

梳理一下proposal_layer的实现过程

1、proposal_layer是根据RPN输出结果选择所需目标框roi，在训练和测试的过程中都需要执行，只是选择的roi数目不同，因此在代码的开始进行了设置；

2、在选择RPN输出前景概率时，是选择18个通道中的后9个通道。这18个数表示了9个初始框的各自的前景分数和背景分数，而这18个值的排序，是下图所示的：
在这里插入图片描述
3、bbox_transform_inv：结合rpn的输出更新初始框的坐标，得到第一个更新坐标后的proposal

def bbox_transform_inv(boxes, deltas):
    # anchor box 与 rpn_bbox_pred(偏移量)计算得到anchor G'（pred bbox）
    # Input should be both tensor or both Variable and on the same device
    if len(boxes) == 0:
        return deltas.detach() * 0

    widths = boxes[:, 2] - boxes[:, 0] + 1.0
    heights = boxes[:, 3] - boxes[:, 1] + 1.0
    ctr_x = boxes[:, 0] + 0.5 * widths
    ctr_y = boxes[:, 1] + 0.5 * heights

    dx = deltas[:, 0::4]
    dy = deltas[:, 1::4]
    dw = deltas[:, 2::4]
    dh = deltas[:, 3::4]

    pred_ctr_x = dx * widths.unsqueeze(1) + ctr_x.unsqueeze(1)
    pred_ctr_y = dy * heights.unsqueeze(1) + ctr_y.unsqueeze(1)
    pred_w = torch.exp(dw) * widths.unsqueeze(1)
    pred_h = torch.exp(dh) * heights.unsqueeze(1)

    pred_boxes = torch.cat( \
        [_.unsqueeze(2) for _ in [pred_ctr_x - 0.5 * pred_w, \
                                  pred_ctr_y - 0.5 * pred_h, \
                                  pred_ctr_x + 0.5 * pred_w, \
                                  pred_ctr_y + 0.5 * pred_h]], 2).view(len(boxes), -1)
    # pred_boxes:[1×29×63×9， 4]
    return pred_boxes

4、clip_boxes:将proposal的坐标值规范到图像边界之内

def clip_boxes(boxes, im_shape):
    """
    Clip boxes to image boundaries.
    boxes must be tensor or Variable, im_shape can be anything but Variable
    """
    # hasattr() 判断对象是否包含对应的属性
    if not hasattr(boxes, 'data'):
        boxes_ = boxes.numpy()

    boxes = boxes.view(boxes.size(0), -1, 4)
    boxes = torch.stack( \
        [boxes[:, :, 0].clamp(0, im_shape[1] - 1),
         boxes[:, :, 1].clamp(0, im_shape[0] - 1),
         boxes[:, :, 2].clamp(0, im_shape[1] - 1),
         boxes[:, :, 3].clamp(0, im_shape[0] - 1)], 2).view(boxes.size(0), -1)

    return boxes

5、将score值进行排序，筛选出pre_nms_topN（训练时12000个，测试时6000个）个anchor，然后进行nms进一步筛选，最后只保留post_nms_topN（训练2000，测试300）个anchor；

6、在保留框的坐标信息前面插入batch中图片信息的编号，由于batch_size为1，因此都插入0，最后返回的blob：[2000, 5], scores[2000, 1]

2.2.2.4 roi_data

      layer {  
      name: 'roi-data'  
      type: 'Python'  
      bottom: 'rpn_rois'  
      bottom: 'gt_boxes'  
      top: 'rois'  
      top: 'labels'  
      top: 'bbox_targets'  
      top: 'bbox_inside_weights'  
      top: 'bbox_outside_weights'  
      python_param {  
        module: 'rpn.proposal_target_layer'  
        layer: 'ProposalTargetLayer'  
        param_str: "'num_classes': 81"  
      }  
    }

此部分对应原码中proposal_target_layer.py

def proposal_target_layer(rpn_rois, rpn_scores, gt_boxes, _num_classes):
    """
    Assign object detection proposals to ground-truth targets. Produces proposal
    classification labels and bounding-box regression targets.
    """

    # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN
    # (i.e., rpn.proposal_layer.ProposalLayer), or any other source
    all_rois = rpn_rois
    all_scores = rpn_scores

    # Include ground-truth boxes in the set of candidate rois
    # 当做区域采样的时候是否把ground truth boxes加入池化层，config默认是False
    if cfg.TRAIN.USE_GT:
        zeros = rpn_rois.new_zeros(gt_boxes.shape[0], 1)
        # all_rois:[g+r, 5]
        all_rois = torch.cat((all_rois, torch.cat(
            (zeros, gt_boxes[:, :-1]), 1)), 0)
        # torch.cat() dim=0,按行拼接；dim=1,按列拼接
        # not sure if it a wise appending, but anyway i am not using it
        all_scores = torch.cat((all_scores, zeros), 0)

    num_images = 1
    # 训练中的batch_size
    rois_per_image = cfg.TRAIN.BATCH_SIZE / num_images    # 256
    fg_rois_per_image = int(round(cfg.TRAIN.FG_FRACTION * rois_per_image))  # 64

    # Sample rois with classification labels and bounding box regression
    # targets
    # 选取具有分类标签和回归框的rois
    labels, rois, roi_scores, bbox_targets, bbox_inside_weights = _sample_rois(
        all_rois, all_scores, gt_boxes, fg_rois_per_image, rois_per_image,
        _num_classes)

    rois = rois.view(-1, 5)
    roi_scores = roi_scores.view(-1)
    labels = labels.view(-1, 1)
    bbox_targets = bbox_targets.view(-1, _num_classes * 4)
    bbox_inside_weights = bbox_inside_weights.view(-1, _num_classes * 4)
    bbox_outside_weights = (bbox_inside_weights > 0).float()

    return rois, roi_scores, labels, bbox_targets, bbox_inside_weights, bbox_outside_weights

梳理一下proposal target layer的实现过程：

1、若cfg.TRAIN.USE_GT=True，将ground truth 框加入到RPN输出的roi中，相当于增加了前景的数量，此时roi的数量变成了N（RPN输出的框）+ M（ground truth）；

2、进入_sample_rois函数，首先计算所有的roi和ground truth框的重合度(IoU)，然后对于每个roi，找到对应的ground truth框和正确的类别标签。

def _sample_rois(all_rois, all_scores, gt_boxes, fg_rois_per_image,
                 rois_per_image, num_classes):
    """
    Generate a random sample of RoIs comprising foreground and background examples.
    """
    # overlaps: (rois x gt_boxes)
    # rois和ground truth之间的IOU值，输入rois：N*4  gt_boxes:图像中所有目标的坐标K*4
    # 返回值：overlaps rois和所有目标框的IOU值 N*K [N, K]
    overlaps = bbox_overlaps(all_rois[:, 1:5].data, gt_boxes[:, :4].data)
    # 每个rois与ground truth之间最大IOU对应的值和索引
    max_overlaps, gt_assignment = overlaps.max(1)
    # 每个rois对应的最大IOU的gt_boxes的标签值 [N]
    labels = gt_boxes[gt_assignment, [4]]

    # Select foreground RoIs as those with >= FG_THRESH overlap
    # fg_thresh=0.5, 大于0.5的认为有检测目标
    fg_inds = (max_overlaps >= cfg.TRAIN.FG_THRESH).nonzero().view(-1)
    # Guard against the case when an image has fewer than fg_rois_per_image
    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    # bg_thresh_hi=0.5,bg_thresh_lo=0.1,重合率在此范围内的认为是背景
    bg_inds = ((max_overlaps < cfg.TRAIN.BG_THRESH_HI) + (
            max_overlaps >= cfg.TRAIN.BG_THRESH_LO) == 2).nonzero().view(-1)

    # Small modification to the original version where we ensure a fixed number of regions are sampled
    # 采样候选区为固定数量
    if fg_inds.numel() > 0 and bg_inds.numel() > 0:   # numel() 返回数组/张量中元素的个数 背景和前景的个数
        fg_rois_per_image = min(fg_rois_per_image, fg_inds.numel())  # 每张图片中roi表示前景的数
        fg_inds = fg_inds[torch.from_numpy(
            npr.choice(
                np.arange(0, fg_inds.numel()),
                size=int(fg_rois_per_image),
                replace=False)).long().to(gt_boxes.device)]

        bg_rois_per_image = rois_per_image - fg_rois_per_image  # 背景的个数
        to_replace = bg_inds.numel() < bg_rois_per_image  # True or False
        bg_inds = bg_inds[torch.from_numpy(
            npr.choice(
                np.arange(0, bg_inds.numel()),
                size=int(bg_rois_per_image),
                replace=to_replace)).long().to(gt_boxes.device)]

    elif fg_inds.numel() > 0:
        to_replace = fg_inds.numel() < rois_per_image
        fg_inds = fg_inds[torch.from_numpy(
            npr.choice(
                np.arange(0, fg_inds.numel()),
                size=int(rois_per_image),
                replace=to_replace)).long().to(gt_boxes.device)]
        fg_rois_per_image = rois_per_image

    elif bg_inds.numel() > 0:
        to_replace = bg_inds.numel() < rois_per_image
        bg_inds = bg_inds[torch.from_numpy(
            npr.choice(
                np.arange(0, bg_inds.numel()),
                size=int(rois_per_image),
                replace=to_replace)).long().to(gt_boxes.device)]
        fg_rois_per_image = 0
    else:
        import pdb
        pdb.set_trace()

    # The indices that we're selecting (both fg and bg)
    keep_inds = torch.cat([fg_inds, bg_inds], 0)  # 记录最终保留框的索引值
    # Select sampled values from various arrays:
    labels = labels[keep_inds].contiguous()
    # Clamp labels for the background RoIs to 0
    labels[int(fg_rois_per_image):] = 0

    rois = all_rois[keep_inds].contiguous()
    roi_scores = all_scores[keep_inds].contiguous()

    bbox_target_data = _compute_targets(
        rois[:, 1:5].data, gt_boxes[gt_assignment[keep_inds]][:, :4].data,
        labels.data)   # rois与gt_boxes之间的偏移量 [256, 5]

    bbox_targets, bbox_inside_weights = \
        _get_bbox_regression_labels(bbox_target_data, num_classes)

    return labels, rois, roi_scores, bbox_targets, bbox_inside_weights

3、为一个训练batch，在全部roi中选择前景框(前景框不能太多，最多只能占训练batch的1/4)和背景框。

4、为进行该batch训练的框置分类标签，并通过_compute_targets函数计算坐标回归标签。

def _compute_targets(ex_rois, gt_rois, labels):
    """Compute bounding-box regression targets for an image."""
    # Inputs are tensor

    assert ex_rois.shape[0] == gt_rois.shape[0]
    assert ex_rois.shape[1] == 4
    assert gt_rois.shape[1] == 4

    targets = bbox_transform(ex_rois, gt_rois)  # gt_box和roi_box 之间的偏移量
    if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED:
        # Optionally normalize targets by a precomputed mean and stdev
        targets = ((targets - targets.new(cfg.TRAIN.BBOX_NORMALIZE_MEANS)) /
                   targets.new(cfg.TRAIN.BBOX_NORMALIZE_STDS))
    return torch.cat([labels.unsqueeze(1), targets], 1)

5、通过_get_bbox_regression_labels函数将坐标回归标签扩充，变成训练所需的格式。

def _get_bbox_regression_labels(bbox_target_data, num_classes):
    """
    Bounding-box regression targets (bbox_target_data) are stored in a
    compact form N x (class, tx, ty, tw, th)

    This function expands those targets into the 4-of-4*K representation used
    by the network (i.e. only one class has non-zero targets).

    Returns:
      bbox_target (ndarray): N x 4K blob of regression targets
      bbox_inside_weights (ndarray): N x 4K blob of loss weights
   """

    clss = bbox_target_data[:, 0]  # 用来训练的每个roi的类别

    # 用0初始化边框回归值，对每个roi，对每个类别都设置4个坐标回归值
    bbox_targets = clss.new_zeros(clss.numel(), 4 * num_classes)
    bbox_inside_weights = clss.new_zeros(bbox_targets.shape)   # 对每个roi的bbox_inside_weights初始为0
    inds = (clss > 0).nonzero().view(-1)  # 找到属于前景的

    if inds.numel() > 0:
        clss = clss[inds].contiguous().view(-1, 1)
        dim1_inds = inds.unsqueeze(1).expand(inds.size(0), 4)
        dim2_inds = torch.cat(
            [4 * clss, 4 * clss + 1, 4 * clss + 2, 4 * clss + 3], 1).long()
        #
        bbox_targets[dim1_inds, dim2_inds] = bbox_target_data[inds][:, 1:]
        bbox_inside_weights[dim1_inds, dim2_inds] = bbox_targets.new(
            cfg.TRAIN.BBOX_INSIDE_WEIGHTS).view(-1, 4).expand_as(dim1_inds)

    return bbox_targets, bbox_inside_weights

主要作用：
① RPN层只是来确定region proposal是否是物体(是/否),这里根据region proposal和ground truth box的最大重叠指定具体的标签；
② 计算region proposal与ground truth boxes的偏移量，计算方法和之前的偏移量计算公式相同。

经过这一步后的数据输入到ROI Pooling层进行进一步的分类和定位

2.3 ROI Pooling

    layer {  
      name: "roi_pool5"  
      type: "ROIPooling"  
      bottom: "conv5_3"   #输入特征图大小
      bottom: "rois"      #输入region proposal
      top: "pool5"     #输出固定大小的feature map
      roi_pooling_param {  
      pooled_w: 7  
      pooled_h: 7  
      spatial_scale: 0.0625 # 1/16  
     }  
    }

ROI Pooling 输入的是卷积后的feature map 和rpn得到的region proposal。输出n * n 同等大小的特征图。

2.3.1 ROI Pooling

roipooling在fasterrcnn中使生成的候选框region proposal映射产生固定大小的feature map，roipooling的工作原理为：
在这里插入图片描述
1、Conv layers使用的是VGG16，feat_stride=32(即表示，经过网络层后图片缩小为原图的1/32),原图800800,最后一层特征图feature map大小:2525；

2、假定原图中有一region proposal，大小为665665，这样，映射到特征图中的大小：665/32=20.78,即20.7820.78，Caffe版的Roi Pooling的C++源码，在计算的时候会进行取整操作，于是，进行所谓的第一次量化，即映射的特征图大小为20*20；

3、假定pooled_w=7,pooled_h=7,即pooling后固定成77大小的特征图，所以，将上面在 feature map上映射的2020的 region proposal划分成49个同等大小的小区域，每个小区域的大小20/7=2.86,即2.862.86，此时，进行第二次量化，故小区域大小变成22；

4、每个22的小区域里，取出其中最大的像素值，作为这一个区域的‘代表’，这样，49个小区域就输出49个像素值，组成77大小的feature map。

总结，所以，通过上面可以看出，经过两次量化，即将浮点数取整，原本在特征图上映射的2020大小的region proposal，偏差成大小为77的，这样的像素偏差势必会对后层的回归定位产生影响
所以，产生了替代方案，RoiAlign

2.3.2 ROI Align

在mask rcnn中使生成的候选框region proposal映射产生固定大小的feature map提出的
在这里插入图片描述
同样，针对上图，有着类似的映射
1、Conv layers使用的是VGG16，feat_stride=32(即表示，经过网络层后图片缩小为原图的1/32),原图800800,最后一层特征图feature map大小:2525；

2、假定原图中有一region proposal，大小为665665，这样，映射到特征图中的大小：665/32=20.78,即20.7820.78，此时，没有像RoiPooling那样就行取整操作，保留浮点数；

3、假定pooled_w=7,pooled_h=7,即pooling后固定成77大小的特征图，所以，将在 feature map上映射的20.7820.78的region proposal 划分成49个同等大小的小区域，每个小区域的大小20.78/7=2.97,即2.97*2.97

4、假定采样点数为4，即表示，对于每个2.972.97的小区域，平分四份，每一份取其中心点位置，而中心点位置的像素，采用双线性插值法进行计算，这样，就会得到四个点的像素值，如下图

上图中，四个红色叉叉‘×’的像素值是通过双线性插值算法计算得到的
最后，取四个像素值中最大值作为这个小区域(即：2.972.97大小的区域)的像素值，如此类推，同样是49个小区域得到49个像素值，组成7*7大小的feature map

【总结】
根据RoiPooling和RoiAlign实现原理，对于检测图片中大目标物体时，两种方案的差别不大，而如果是图片中有较多小目标物体需要检测，则优先选择RoiAlign，更精准些。

2.4 全连接层

经过roi pooling层之后,对特征图进行全连接，参照下图，最后同样利用Softmax Loss和L1 Loss完成分类和定位。
在这里插入图片描述
通过full connect层与softmax计算每个region proposal具体属于哪个类别，输出cls_prob概率向量；同时再次利用bounding box regression获得每个region proposal的位置偏移量bbox_pred，用于回归获得更加精确的目标检测框。
对应原码实现过程如下：

# cls_num:8
def _region_classification(self, fc7):
    # self.cls_score_net:Linear(in_features=2048, out_features=8, bias=True)
    # cls_score:[256, 8]
    cls_score = self.cls_score_net(fc7)
    # cls_pred:[256]
    cls_pred = torch.max(cls_score, 1)[1]
    # cls_prob:[256, 8]
    cls_prob = F.softmax(cls_score, dim=1)

    # self.bbox_pred_net:Linear(in_features=2048, out_features=32, bias=True)
    # bbox_pred:[256, 32]
    bbox_pred = self.bbox_pred_net(fc7)

    self._predictions["cls_score"] = cls_score
    self._predictions["cls_pred"] = cls_pred
    self._predictions["cls_prob"] = cls_prob
    self._predictions["bbox_pred"] = bbox_pred

    return cls_prob, bbox_pred

即从PoI Pooling获取到7x7大小的proposal feature maps后，通过全连接主要做了：

1)通过全连接和softmax对region proposals进行具体类别的分类；

2)再次对region proposals进行bounding box regression，获取更高精度的rectangle box。

2.5 损失函数

整个网络使用的loss如下：
在这里插入图片描述
上述公式中 i 表示anchors index，

p_{i}

$p_{i}$ 表示positive softmax probability，

∗

p_{i}^{*}

$p_{i *}$ 代表对应的GT predict概率（即当第i个anchor与GT间IoU>0.7，认为是该anchor是positive，

∗

p_{i}^{*}

$p_{i *}$ =1；反之IoU<0.3时，认为是该anchor是negative，

∗

p_{i}^{*}

$p_{i *}$ = 0；至于那些0.3<IoU<0.7的anchor则不参与训练）；t 代表predict bounding box，

∗

t^{*}

$t^{*}$ 代表对应positive anchor对应的GT box。可以看到，整个Loss分为2部分：

1、cls loss，即rpn_cls_loss层计算的softmax loss，用于分类anchors为positive与negative的网络训练；
2、 reg loss，即rpn_loss_bbox层计算的soomth L1 loss，用于bounding box regression网络训练。注意在该loss中乘了

∗

p_{i}^{*}

$p_{i *}$ ，相当于只关心positive anchors的回归（其实在回归中也完全没必要去关心negative）。

由于在实际过程中，

N_{cls}

$N_{c l s}$ 和

N_{reg}

$N_{r e g}$ 差距过大，用参数λ平衡二者（如

N_{cls}

$N_{c l s}$ =256，

N_{reg}

$N_{r e g}$ =2400时设置

≈

\lambda = \frac{N_{reg}}{N_{cls}}\approx 10

$λ = \frac{N _{r e g}}{N _{c l s}} \approx 10$ ），使总的网络Loss计算过程中能够均匀考虑2种Loss。这里比较重要是

L_{reg}

$L_{r e g}$ 使用的soomth L1 loss，计算公式如下：
在这里插入图片描述

1、在RPN训练阶段，rpn-data（python AnchorTargetLayer）层会按照和test阶段Proposal层完全一样的方式生成Anchors用于训练;

2、对于rpn_loss_cls，输入的rpn_cls_scors_reshape和rpn_labels分别对应 p 与

∗

p^{*}

$p^{*}$ ，

N_{cls}

$N_{c l s}$ 参数隐含在与的caffe blob的大小中

3、对于rpn_loss_bbox，输入的rpn_bbox_pred和rpn_bbox_targets分别对应 t 与

∗

t^{*}

$t^{*}$ , rpn_bbox_inside_weigths对应

∗

p^{*}

$p^{*}$ ，rpn_bbox_outside_weigths未用到（从smooth_L1_Loss layer代码中可以看到），而

N_{reg}

$N_{r e g}$ 同样隐含在caffe blob大小中

版权声明：本文为CSDN博主「Dear_learner」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/Dear_learner/article/details/122579463

一、模型的整体框架

二、网络结构

2.1 Conv Layers

2.2 RPN(Region Proposal Networks)

2.2.1 Anchors box的生成

2.2.2 RPN实现原理

2.2.2.1 rpn-data

2.2.2.2 rpn_loss_cls,rpn_loss_bbox,rpn_cls_prob

2.2.2.3 proposal

2.2.2.4 roi_data

2.3 ROI Pooling

2.3.1 ROI Pooling

2.3.2 ROI Align

2.4 全连接层

2.5 损失函数

yolov5代码解读-网络架构

野生yolov3在atlas200dk的部署

Dear_learner

暂无评论

发表评论取消回复

一、模型的整体框架

二、网络结构

2.1 Conv Layers

2.2 RPN(Region Proposal Networks)

2.2.1 Anchors box的生成

2.2.2 RPN实现原理

2.2.2.1 rpn-data

2.2.2.2 rpn_loss_cls,rpn_loss_bbox,rpn_cls_prob

2.2.2.3 proposal

2.2.2.4 roi_data

2.3 ROI Pooling

2.3.1 ROI Pooling

2.3.2 ROI Align

2.4 全连接层

2.5 损失函数

yolov5代码解读-网络架构

野生yolov3在atlas200dk的部署

Dear_learner

暂无评论

发表评论 取消回复

相关推荐

发表评论取消回复