双飞少妇,美女操逼免费看,欧日韩黄色视频,麻豆操人,天天躁日日躁AAAA视频,大鸡巴网站免费看,超碰成人福利在线,亚洲欧美99

作者丨白裳@知乎? ?

來源丨h(huán)ttps://zhuanlan.zhihu.com/p/145842317

導(dǎo)讀

本文詳細的介紹了 torchvision 中的 FasterRCNN 代碼實現(xiàn)，并分析了作者認為重要的知識點，GeneralizedRCNN的代碼以及FasterRCNN的訓(xùn)練等。幫助入門的小伙伴更好的理解模型細節(jié)的問題。

目前 pytorch 已經(jīng)在 torchvision 模塊集成了 FasterRCNN 和 MaskRCNN 代碼。考慮到幫助各位小伙伴理解模型細節(jié)問題，本文分析一下 FasterRCNN 代碼，幫助新手理解 Two-Stage 檢測中的主要問題。

這篇文章默認讀者已經(jīng)對 FasterRCNN 原理有一定了解。否則請先點擊閱讀上一篇文章：

白裳：一文讀懂Faster RCNN

torchvision 中 FasterRCNN 代碼文檔如下：

https://pytorch.org/docs/stable/torchvision/models.html#faster-r-cnnpytorch.org

在 python 中裝好 torchvision 后，輸入以下命令即可查看版本和代碼位置：

import torchvision

print(torchvision.__version__)
# '0.6.0'
print(torchvision.__path__) 
# ['/usr/local/lib/python3.7/site-packages/torchvision']

代碼結(jié)構(gòu)

圖1

作為 torchvision 中目標(biāo)檢測基類，GeneralizedRCNN 繼承了 torch.nn.Module，后續(xù) FasterRCNN 、MaskRCNN 都繼承 GeneralizedRCNN。

GeneralizedRCNN

GeneralizedRCNN 繼承基類 nn.Module 。首先來看看基類 GeneralizedRCNN 的代碼：

class GeneralizedRCNN(nn.Module):
    def __init__(self, backbone, rpn, roi_heads, transform):
        super(GeneralizedRCNN, self).__init__()
        self.transform = transform
        self.backbone = backbone
        self.rpn = rpn
        self.roi_heads = roi_heads
        # used only on torchscript mode
        self._has_warned = False

    @torch.jit.unused
    def eager_outputs(self, losses, detections):
        # type: (Dict[str, Tensor], List[Dict[str, Tensor]]) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]
        if self.training:
            return losses

        return detections

    def forward(self, images, targets=None):
        if self.training and targets is None:
            raise ValueError("In training mode, targets should be passed")
        original_image_sizes = torch.jit.annotate(List[Tuple[int, int]], [])
        for img in images:
            val = img.shape[-2:]
            assert len(val) == 2
            original_image_sizes.append((val[0], val[1]))

        images, targets = self.transform(images, targets)
        features = self.backbone(images.tensors)
        if isinstance(features, torch.Tensor):
            features = OrderedDict([('0', features)])
        proposals, proposal_losses = self.rpn(images, features, targets)
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

        losses = {}
        losses.update(detector_losses)
        losses.update(proposal_losses)

        if torch.jit.is_scripting():
            if not self._has_warned:
                warnings.warn("RCNN always returns a (Losses, Detections) tuple in scripting")
                self._has_warned = True
            return (losses, detections)
        else:
            return self.eager_outputs(losses, detections)

對于 GeneralizedRCNN 類，其中有4個重要的接口：

transform
backbone
rpn
roi_heads

transform

# GeneralizedRCNN.forward(...)
for img in images:
    val = img.shape[-2:]
    assert len(val) == 2
    original_image_sizes.append((val[0], val[1]))

images, targets = self.transform(images, targets)

圖2 transform接口

transform主要做2件事：

將輸入進行標(biāo)準(zhǔn)化（如FasterRCNN是對輸入減 image_mean 再除 image_std）
將圖像縮放到固定大小（同時也要對應(yīng)縮放 targets 中標(biāo)記框）

需要說明，由于把縮放后的圖像輸入網(wǎng)絡(luò)，那么網(wǎng)絡(luò)輸出的檢測框也是在縮放后的圖像上的。但是實際中我們需要的是在原始圖像的檢測框，為了對應(yīng)起來，所以需要記錄變換前original_images_sizes 。

圖3

這里解釋一下為何要縮放圖像。對于 FasterRCNN，從純理論上來說確實可以支持任意大小的圖片。但是實際中，如果輸入圖像太大（如6000x4000）會直接撐爆內(nèi)存。考慮到工程問題，縮放是一個比較穩(wěn)妥的折衷選擇。

backbone + rpn + roi_heads

圖4

完成圖像縮放之后其實才算是正式進入網(wǎng)絡(luò)流程。接下來有4個步驟：

將 transform 后的圖像輸入到 backbone 模塊提取特征圖

# GeneralizedRCNN.forward(...)
features = self.backbone(images.tensors)

backbone 一般為 VGG、ResNet、MobileNet 等網(wǎng)絡(luò)。

然后經(jīng)過 rpn 模塊生成 proposals 和 proposal_losses

# GeneralizedRCNN.forward(...)
features = self.backbone(images.tensors)

接著進入 roi_heads 模塊（即 roi_pooling + 分類）

# GeneralizedRCNN.forward(...)
detections, detector_losses = 
        self.roi_heads(features, proposals, images.image_sizes, targets

最后經(jīng) postprocess 模塊（進行 NMS，同時將 box 通過 original_images_size映射回原圖）

# GeneralizedRCNN.forward(...)
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

FasterRCNN

FasterRCNN 繼承基類 GeneralizedRCNN。

class FasterRCNN(GeneralizedRCNN):

    def __init__(self, backbone, num_classes=None,
                 # transform parameters
                 min_size=800, max_size=1333,
                 image_mean=None, image_std=None,
                 # RPN parameters
                 rpn_anchor_generator=None, rpn_head=None,
                 rpn_pre_nms_top_n_train=2000, rpn_pre_nms_top_n_test=1000,
                 rpn_post_nms_top_n_train=2000, rpn_post_nms_top_n_test=1000,
                 rpn_nms_thresh=0.7,
                 rpn_fg_iou_thresh=0.7, rpn_bg_iou_thresh=0.3,
                 rpn_batch_size_per_image=256, rpn_positive_fraction=0.5,
                 # Box parameters
                 box_roi_pool=None, box_head=None, box_predictor=None,
                 box_score_thresh=0.05, box_nms_thresh=0.5, box_detections_per_img=100,
                 box_fg_iou_thresh=0.5, box_bg_iou_thresh=0.5,
                 box_batch_size_per_image=512, box_positive_fraction=0.25,
                 bbox_reg_weights=None):

        out_channels = backbone.out_channels

        if rpn_anchor_generator is None:
            anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)
            rpn_anchor_generator = AnchorGenerator(
                anchor_sizes, aspect_ratios
            )
        if rpn_head is None:
            rpn_head = RPNHead(
                out_channels, rpn_anchor_generator.num_anchors_per_location()[0]
            )

        rpn_pre_nms_top_n = dict(training=rpn_pre_nms_top_n_train, testing=rpn_pre_nms_top_n_test)
        rpn_post_nms_top_n = dict(training=rpn_post_nms_top_n_train, testing=rpn_post_nms_top_n_test)

        rpn = RegionProposalNetwork(
            rpn_anchor_generator, rpn_head,
            rpn_fg_iou_thresh, rpn_bg_iou_thresh,
            rpn_batch_size_per_image, rpn_positive_fraction,
            rpn_pre_nms_top_n, rpn_post_nms_top_n, rpn_nms_thresh)

        if box_roi_pool is None:
            box_roi_pool = MultiScaleRoIAlign(
                featmap_names=['0', '1', '2', '3'],
                output_size=7,
                sampling_ratio=2)

        if box_head is None:
            resolution = box_roi_pool.output_size[0]
            representation_size = 1024
            box_head = TwoMLPHead(
                out_channels * resolution ** 2,
                representation_size)

        if box_predictor is None:
            representation_size = 1024
            box_predictor = FastRCNNPredictor(
                representation_size,
                num_classes)

        roi_heads = RoIHeads(
            # Box
            box_roi_pool, box_head, box_predictor,
            box_fg_iou_thresh, box_bg_iou_thresh,
            box_batch_size_per_image, box_positive_fraction,
            bbox_reg_weights,
            box_score_thresh, box_nms_thresh, box_detections_per_img)

        if image_mean is None:
            image_mean = [0.485, 0.456, 0.406]
        if image_std is None:
            image_std = [0.229, 0.224, 0.225]
        transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)

        super(FasterRCNN, self).__init__(backbone, rpn, roi_heads, transform)

FasterRCNN 實現(xiàn)了 GeneralizedRCNN 中的 transform、backbone、rpn、roi_heads 接口：

# FasterRCNN.__init__(...)
super(FasterRCNN, self).__init__(backbone, rpn, roi_heads, transform)

對于 transform 接口，使用 GeneralizedRCNNTransform 實現(xiàn)。從代碼變量名可以明顯看到包含：

與縮放相關(guān)參數(shù)：min_size + max_size
與歸一化相關(guān)參數(shù)：image_mean + image_std（對輸入[0, 1]減去image_mean再除以image_std）

# FasterRCNN.__init__(...)
if image_mean is None:
    image_mean = [0.485, 0.456, 0.406]
if image_std is None:
    image_std = [0.229, 0.224, 0.225]
transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)

對于 backbone 使用 ResNet50 + FPN 結(jié)構(gòu)：

def fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, **kwargs):
    if pretrained:
        # no need to download the backbone if pretrained is set
        pretrained_backbone = False
    backbone = resnet_fpn_backbone('resnet50', pretrained_backbone)
    model = FasterRCNN(backbone, num_classes, **kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls['fasterrcnn_resnet50_fpn_coco'], progress=progress)
        model.load_state_dict(state_dict)
    return model

ResNet: Deep Residual Learning for Image Recognition

FPN: Feature Pyramid Networks for Object Detection

圖5 FPN

接下來重點介紹 rpn 接口的實現(xiàn)。首先是 rpn_anchor_generator :

# FasterRCNN.__init__(...)
if rpn_anchor_generator is None:
    anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
    aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)
    rpn_anchor_generator = AnchorGenerator(
        anchor_sizes, aspect_ratios
    )

對于普通的 FasterRCNN 只需要將 feature_map 輸入到 rpn 網(wǎng)絡(luò)生成 proposals 即可。但是由于加入 FPN，需要將多個 feature_map 逐個輸入到 rpn 網(wǎng)絡(luò)。

圖6

接下來看看 AnchorGenerator 具體實現(xiàn)：

class AnchorGenerator(nn.Module):
        ......

    def generate_anchors(self, scales, aspect_ratios, dtype=torch.float32, device="cpu"):
        # type: (List[int], List[float], int, Device)  # noqa: F821
        scales = torch.as_tensor(scales, dtype=dtype, device=device)
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1 / h_ratios

        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)

        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
        return base_anchors.round()

    def set_cell_anchors(self, dtype, device):
        # type: (int, Device) -> None    # noqa: F821
        ......

        cell_anchors = [
            self.generate_anchors(
                sizes,
                aspect_ratios,
                dtype,
                device
            )
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
        ]
        self.cell_anchors = cell_anchors

首先，每個位置有 5 種 anchor_size 和 3 種 aspect_ratios，所以每個位置生成 15 個 base_anchors：

[ -23.,  -11.,   23.,   11.]
[ -16.,  -16.,   16.,   16.] # w = h = 32,  ratio = 1
[ -11.,  -23.,   11.,   23.] 
[ -45.,  -23.,   45.,   23.]
[ -32.,  -32.,   32.,   32.] # w = h = 64,  ratio = 1
[ -23.,  -45.,   23.,   45.]
[ -91.,  -45.,   91.,   45.]
[ -64.,  -64.,   64.,   64.] # w = h = 128, ratio = 1
[ -45.,  -91.,   45.,   91.]
[-181.,  -91.,  181.,   91.]
[-128., -128.,  128.,  128.] # w = h = 256, ratio = 1
[ -91., -181.,   91.,  181.]
[-362., -181.,  362.,  181.]
[-256., -256.,  256.,  256.] # w = h = 512, ratio = 1
[-181., -362.,  181.,  362.]

注意 base_anchors 的中心都是點，如下圖所示：

圖7 base_anchor（此圖只畫了32/64/128的base_anchor）

接著來看 AnchorGenerator.grid_anchors 函數(shù)：

# AnchorGenerator
   def grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]])
        anchors = []
        cell_anchors = self.cell_anchors
        assert cell_anchors is not None

        for size, stride, base_anchors in zip(
            grid_sizes, strides, cell_anchors
        ):
            grid_height, grid_width = size
            stride_height, stride_width = stride
            device = base_anchors.device

            # For output anchor, compute [x_center, y_center, x_center, y_center]
            shifts_x = torch.arange(
                0, grid_width, dtype=torch.float32, device=device
            ) * stride_width
            shifts_y = torch.arange(
                0, grid_height, dtype=torch.float32, device=device
            ) * stride_height
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)
            shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)

            # For every (base anchor, output anchor) pair,
            # offset each zero-centered base anchor by the center of the output anchor.
            anchors.append(
                (shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4)
            )

        return anchors

    def forward(self, image_list, feature_maps):
        # type: (ImageList, List[Tensor])
        grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])
        image_size = image_list.tensors.shape[-2:]
        dtype, device = feature_maps[0].dtype, feature_maps[0].device
        strides = [[torch.tensor(image_size[0] / g[0], dtype=torch.int64, device=device),
                    torch.tensor(image_size[1] / g[1], dtype=torch.int64, device=device)] for g in grid_sizes]
        self.set_cell_anchors(dtype, device)
        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)
        ......

在之前提到，由于有 FPN 網(wǎng)絡(luò)，所以輸入 rpn 的是多個特征。為了方便介紹，以下都是以某一個特征進行描述，其他特征類似。

假設(shè)有的特征，首先會計算這個特征相對于輸入圖像的下采樣倍數(shù) stride：

然后生成一個大小的網(wǎng)格，每個格子長度為 stride，如下圖：

# AnchorGenerator.grid_anchors(...)
shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height
shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)

圖8

然后將 base_anchors 的中心從移動到網(wǎng)格的點，且在網(wǎng)格的每個點都放置一組 base_anchors。這樣就在當(dāng)前 feature_map 上有了很多的 anchors。

需要特別說明，stride 代表網(wǎng)絡(luò)的感受野，網(wǎng)絡(luò)不可能檢測到比 feature_map 更密集的框了！所以才只會在網(wǎng)格中每個點設(shè)置 anchors（反過來說，如果在網(wǎng)格的兩個點之間設(shè)置 anchors，那么就對應(yīng) feature_map 中半個點，顯然不合理）。

# AnchorGenerator.grid_anchors(...)
anchors.append((shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4))

圖9 （注：為了方便描述，這里只畫了3個anchor，實際每個點有9個anchor）

放置好 anchors 后，接下來就要調(diào)整網(wǎng)絡(luò)，使網(wǎng)絡(luò)輸出能夠判斷每個 anchor 是否有目標(biāo)，同時還要有 bounding box regression 需要的4個值。

class RPNHead(nn.Module):
    def __init__(self, in_channels, num_anchors):
        super(RPNHead, self).__init__()
        self.conv = nn.Conv2d(
            in_channels, in_channels, kernel_size=3, stride=1, padding=1
        )
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        self.bbox_pred = nn.Conv2d(
            in_channels, num_anchors * 4, kernel_size=1, stride=1
        )

    def forward(self, x):
        logits = []
        bbox_reg = []
        for feature in x:
            t = F.relu(self.conv(feature))
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

假設(shè) feature 的大小，每個點個 anchor。從 RPNHead 的代碼中可以明顯看到：

首先進行 3x3 卷積
然后對 feature 進行卷積，輸出 cls_logits 大小是，對應(yīng)每個 anchor 是否有目標(biāo)；
同時feature 進行卷積，輸出 bbox_pred 大小是，對應(yīng)每個點的4個框位置回歸信息。

# RPNHead.__init__(...)
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)         
self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)

圖10（注：為了方便描述，這里只畫了3個anchor，實際每個點有9個anchor）

上述過程只是單個 feature_map 的處理流程。對于 FPN 網(wǎng)絡(luò)的輸出的多個大小不同的 feature_maps，每個特征圖都會按照上述過程計算 stride 和網(wǎng)格，并設(shè)置 anchors。當(dāng)處理完后獲得密密麻麻的各種 anchors 了。

接下來進入 RegionProposalNetwork 類：

# FasterRCNN.__init__(...)
rpn_pre_nms_top_n = dict(training=rpn_pre_nms_top_n_train, testing=rpn_pre_nms_top_n_test)
rpn_post_nms_top_n = dict(training=rpn_post_nms_top_n_train, testing=rpn_post_nms_top_n_test)

# rpn_anchor_generator 生成anchors
# rpn_head 調(diào)整feature_map獲得cls_logits+bbox_pred
rpn = RegionProposalNetwork(
    rpn_anchor_generator, rpn_head,
    rpn_fg_iou_thresh, rpn_bg_iou_thresh,
    rpn_batch_size_per_image, rpn_positive_fraction,
    rpn_pre_nms_top_n, rpn_post_nms_top_n, rpn_nms_thresh)

RegionProposalNetwork 類的用是：

test 階段：計算有目標(biāo)的 anchor 并進行框回歸生成 proposals，然后 NMS
train 階段：除了上面的作用，還計算 rpn loss

class RegionProposalNetwork(torch.nn.Module):
    .......

    def forward(self, images, features, targets=None):
        features = list(features.values())
        objectness, pred_bbox_deltas = self.head(features)
        anchors = self.anchor_generator(images, features)

        num_images = len(anchors)
        num_anchors_per_level_shape_tensors = [o[0].shape for o in objectness]
        num_anchors_per_level = [s[0] * s[1] * s[2] for s in num_anchors_per_level_shape_tensors]
        objectness, pred_bbox_deltas = \
            concat_box_prediction_layers(objectness, pred_bbox_deltas)
        # apply pred_bbox_deltas to anchors to obtain the decoded proposals
        # note that we detach the deltas because Faster R-CNN do not backprop through
        # the proposals
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)

        losses = {}
        if self.training:
            assert targets is not None
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
            loss_objectness, loss_rpn_box_reg = self.compute_loss(
                objectness, pred_bbox_deltas, labels, regression_targets)
            losses = {
                "loss_objectness": loss_objectness,
                "loss_rpn_box_reg": loss_rpn_box_reg,
            }
        return boxes, losses

具體來看，首先計算有目標(biāo)的 anchor 并進行框回歸生成 proposals ：

# RegionProposalNetwork.forward(...)
objectness, pred_bbox_deltas = self.head(features)
anchors = self.anchor_generator(images, features)
......
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
proposals = proposals.view(num_images, -1, 4)

然后依照 objectness 置信由大到小度排序（優(yōu)先提取更可能包含目標(biāo)的的），并 NMS，生成 boxes （即 NMS 后的 proposal boxes ）：

# RegionProposalNetwork.forward(...)
boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)

如果是訓(xùn)練階段，還要將 boxes 與 anchors 進行匹配，計算 cls_logits 的損失 loss_objectness，同時計算 bbox_pred 的損失 loss_rpn_box_reg。

在 RegionProposalNetwork 之后已經(jīng)生成了 boxes ，接下來就要提取 boxes 內(nèi)的特征進行 roi_pooling ：

roi_heads = RoIHeads(
    # Box
    box_roi_pool, box_head, box_predictor,
    box_fg_iou_thresh, box_bg_iou_thresh,
    box_batch_size_per_image, box_positive_fraction,
    bbox_reg_weights,
    box_score_thresh, box_nms_thresh, box_detections_per_img)

這里一點問題是如何計算 box 所屬的 feature_map：

對于原始 FasterRCNN，只在 backbone 的最后一層 feature_map 提取 box 對應(yīng)特征；
當(dāng)加入 FPN 后 backbone 會輸出多個特征圖，需要計算當(dāng)前 boxes 對應(yīng)于哪一個特征。

如下圖：

圖11

class MultiScaleRoIAlign(nn.Module):
   ......

   def infer_scale(self, feature, original_size):
        # type: (Tensor, List[int])
        # assumption: the scale is of the form 2 ** (-k), with k integer
        size = feature.shape[-2:]
        possible_scales = torch.jit.annotate(List[float], [])
        for s1, s2 in zip(size, original_size):
            approx_scale = float(s1) / float(s2)
            scale = 2 ** float(torch.tensor(approx_scale).log2().round())
            possible_scales.append(scale)
        assert possible_scales[0] == possible_scales[1]
        return possible_scales[0]

    def setup_scales(self, features, image_shapes):
        # type: (List[Tensor], List[Tuple[int, int]])
        assert len(image_shapes) != 0
        max_x = 0
        max_y = 0
        for shape in image_shapes:
            max_x = max(shape[0], max_x)
            max_y = max(shape[1], max_y)
        original_input_shape = (max_x, max_y)

        scales = [self.infer_scale(feat, original_input_shape) for feat in features]
        # get the levels in the feature map by leveraging the fact that the network always
        # downsamples by a factor of 2 at each level.
        lvl_min = -torch.log2(torch.tensor(scales[0], dtype=torch.float32)).item()
        lvl_max = -torch.log2(torch.tensor(scales[-1], dtype=torch.float32)).item()
        self.scales = scales
        self.map_levels = initLevelMapper(int(lvl_min), int(lvl_max))

首先計算每個 feature_map 相對于網(wǎng)絡(luò)輸入 image 的下采樣倍率 scale。其中 infer_scale 函數(shù)采用如下的近似公式：

該公式相當(dāng)于做了一個簡單的映射，將不同的 feature_map 與 image 大小比映射到附近的尺度：

圖12

例如對于 FasterRCNN 實際值為：

之后設(shè)置 lvl_min=2 和 lvl_max=5：

# MultiScaleRoIAlign.setup_scales(...)
# get the levels in the feature map by leveraging the fact that the network always
# downsamples by a factor of 2 at each level.
lvl_min = -torch.log2(torch.tensor(scales[0], dtype=torch.float32)).item()
lvl_max = -torch.log2(torch.tensor(scales[-1], dtype=torch.float32)).item()

接著使用 FPN 原文中的公式計算 box 所在 anchor（其中，為 box 面積）：

class LevelMapper(object)
    def __init__(self, k_min, k_max, canonical_scale=224, canonical_level=4, eps=1e-6):
        self.k_min = k_min          # lvl_min=2
        self.k_max = k_max          # lvl_max=5
        self.s0 = canonical_scale   # 224
        self.lvl0 = canonical_level # 4
        self.eps = eps

    def __call__(self, boxlists):
        s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))

        # Eqn.(1) in FPN paper
        target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
        target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
        return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)

其中 torch.clamp(input, min, max) → Tensor 函數(shù)的作用是截斷，防止越界：

可以看到，通過 LevelMapper 類將不同大小的 box 定位到某個 feature_map，如下圖。之后就是按照圖11中的流程進行 roi_pooling 操作。

圖13

在確定 proposal box 所屬 FPN 中哪個 feature_map 之后，接著來看 MultiScaleRoIAlign 如何進行 roi_pooling 操作：

class MultiScaleRoIAlign(nn.Module):
   ......

   def forward(self, x, boxes, image_shapes):
        # type: (Dict[str, Tensor], List[Tensor], List[Tuple[int, int]]) -> Tensor
        x_filtered = []
        for k, v in x.items():
            if k in self.featmap_names:
                x_filtered.append(v)
        num_levels = len(x_filtered)
        rois = self.convert_to_roi_format(boxes)
        if self.scales is None:
            self.setup_scales(x_filtered, image_shapes)

        scales = self.scales
        assert scales is not None

        # 沒有 FPN 時，只有1/32的最后一個feature_map進行roi_pooling
        if num_levels == 1:
            return roi_align(
                x_filtered[0], rois,
                output_size=self.output_size,
                spatial_scale=scales[0],
                sampling_ratio=self.sampling_ratio
            )

        # 有 FPN 時，有4個feature_map進行roi_pooling
        # 首先按照
        mapper = self.map_levels
        assert mapper is not None

        levels = mapper(boxes)

        num_rois = len(rois)
        num_channels = x_filtered[0].shape[1]

        dtype, device = x_filtered[0].dtype, x_filtered[0].device
        result = torch.zeros(
            (num_rois, num_channels,) + self.output_size,
            dtype=dtype,
            device=device,
        )

        tracing_results = []
        for level, (per_level_feature, scale) in enumerate(zip(x_filtered, scales)):
            idx_in_level = torch.nonzero(levels == level).squeeze(1)
            rois_per_level = rois[idx_in_level]

            result_idx_in_level = roi_align(
                per_level_feature, rois_per_level,
                output_size=self.output_size,
                spatial_scale=scale, sampling_ratio=self.sampling_ratio)

            if torchvision._is_tracing():
                tracing_results.append(result_idx_in_level.to(dtype))
            else:
                result[idx_in_level] = result_idx_in_level

        if torchvision._is_tracing():
            result = _onnx_merge_levels(levels, tracing_results)

        return result

在 MultiScaleRoIAlign.forward(...) 函數(shù)可以看到：

沒有 FPN 時，只有1/32的最后一個 feature_map 進行 roi_pooling

   if num_levels == 1:
            return roi_align(
                x_filtered[0], rois,
                output_size=self.output_size,
                spatial_scale=scales[0],
                sampling_ratio=self.sampling_ratio
            )

有 FPN 時，有4個的 feature maps 參加計算。首先計算每個每個 box 所屬哪個 feature map ，再在所屬 feature map 進行 roi_pooling

# 首先計算每個每個 box 所屬哪個 feature map
        levels = mapper(boxes) 
        ......

        # 再在所屬  feature map 進行 roi_pooling
        # 即 idx_in_level = torch.nonzero(levels == level).squeeze(1)
        for level, (per_level_feature, scale) in enumerate(zip(x_filtered, scales)):
            idx_in_level = torch.nonzero(levels == level).squeeze(1)
            rois_per_level = rois[idx_in_level]

            result_idx_in_level = roi_align(
                per_level_feature, rois_per_level,
                output_size=self.output_size,
                spatial_scale=scale, sampling_ratio=self.sampling_ratio)

之后就獲得了所謂的 7x7 特征（在 FasterRCNN.__init__(...) 中設(shè)置了 output_size=7）。需要說明，原始 FasterRCNN 應(yīng)該是使用 roi_pooling，但是這里使用 roi_align 代替以提升檢測器性能。

對于 torchvision.ops.roi_align 函數(shù)輸入的參數(shù)，分別為：

per_level_feature 代表 FPN 輸出的某一 feature_map
rois_per_level 為該特征 feature_map 對應(yīng)的所有 proposal boxes（之前計算 level得到）
output_size=7 代表輸出為 7x7
spatial_scale 代表特征 feature_map 相對輸入 image 的下采樣尺度（如 1/4，1/8，...）
sampling_ratio 為 roi_align 采樣率，有興趣的讀者請自行查閱 MaskRCNN 文章

接下來就是將特征轉(zhuǎn)為最后針對 box 的類別信息（如人、貓、狗、車）和進一步的框回歸信息。

class TwoMLPHead(nn.Module):

    def __init__(self, in_channels, representation_size):
        super(TwoMLPHead, self).__init__()

        self.fc6 = nn.Linear(in_channels, representation_size)
        self.fc7 = nn.Linear(representation_size, representation_size)

    def forward(self, x):
        x = x.flatten(start_dim=1)

        x = F.relu(self.fc6(x))
        x = F.relu(self.fc7(x))

        return x


class FastRCNNPredictor(nn.Module):

    def __init__(self, in_channels, num_classes):
        super(FastRCNNPredictor, self).__init__()
        self.cls_score = nn.Linear(in_channels, num_classes)
        self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

    def forward(self, x):
        if x.dim() == 4:
            assert list(x.shape[2:]) == [1, 1]
        x = x.flatten(start_dim=1)
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)

        return scores, bbox_deltas

首先 TwoMLPHead 將 7x7 特征經(jīng)過兩個全連接層轉(zhuǎn)為 1024，然后 FastRCNNPredictor 將每個 box 對應(yīng)的 1024 維特征轉(zhuǎn)為 cls_score 和 bbox_pred :

圖14

顯然 cls_score 后接 softmax 即為類別概率，可以確定 box 的類別；在確定類別后，在 bbox_pred 中對應(yīng)類別的 4個值即為第二次 bounding box regression 需要的4個偏移值。

簡單的說，帶有FPN的FasterRCNN網(wǎng)絡(luò)結(jié)構(gòu)可以用下圖表示：

圖15

關(guān)于訓(xùn)練

FasterRCNN模型在兩處地方有損失函數(shù)：

在 RegionProposalNetwork 類，需要判別 anchor 中是否包含目標(biāo)從而生成 proposals，這里需要計算 loss
在 RoIHeads 類，對 roi_pooling 后的全連接生成的 cls_score 和 bbox_pred 進行訓(xùn)練，也需要計算 loss

首先來看 RegionProposalNetwork 類中的 assign_targets_to_anchors 函數(shù)。

def assign_targets_to_anchors(self, anchors, targets):
    # type: (List[Tensor], List[Dict[str, Tensor]])
    labels = []
    matched_gt_boxes = []
    for anchors_per_image, targets_per_image in zip(anchors, targets):
        gt_boxes = targets_per_image["boxes"]

        if gt_boxes.numel() == 0:
            # Background image (negative example)
            device = anchors_per_image.device
            matched_gt_boxes_per_image = torch.zeros(anchors_per_image.shape, dtype=torch.float32, device=device)
            labels_per_image = torch.zeros((anchors_per_image.shape[0],), dtype=torch.float32, device=device)
        else:
            match_quality_matrix = box_ops.box_iou(gt_boxes, anchors_per_image)
            matched_idxs = self.proposal_matcher(match_quality_matrix)
            # get the targets corresponding GT for each proposal
            # NB: need to clamp the indices because we can have a single
            # GT in the image, and matched_idxs can be -2, which goes
            # out of bounds
            matched_gt_boxes_per_image = gt_boxes[matched_idxs.clamp(min=0)]

            labels_per_image = matched_idxs >= 0
            labels_per_image = labels_per_image.to(dtype=torch.float32)
            # Background (negative examples)
            bg_indices = matched_idxs == self.proposal_matcher.BELOW_LOW_THRESHOLD
            labels_per_image[bg_indices] = torch.tensor(0.0)

            # discard indices that are between thresholds
            inds_to_discard = matched_idxs == self.proposal_matcher.BETWEEN_THRESHOLDS
            labels_per_image[inds_to_discard] = torch.tensor(-1.0)

        labels.append(labels_per_image)
        matched_gt_boxes.append(matched_gt_boxes_per_image)
    return labels, matched_gt_boxes

當(dāng)圖像中沒有 gt_boxes 時，設(shè)置所有 anchor 都為 background（即 label 為 0）：

if gt_boxes.numel() == 0
    # Background image (negative example)
    device = anchors_per_image.device
    matched_gt_boxes_per_image = torch.zeros(anchors_per_image.shape, dtype=torch.float32, device=device)
    labels_per_image = torch.zeros((anchors_per_image.shape[0],), dtype=torch.float32, device=device)

當(dāng)圖像中有 gt_boxes 時，計算 anchor 與 gt_box 的 IOU：

選擇 IOU < 0.3 的 anchor 為 background，標(biāo)簽為 0

labels_per_image[bg_indices] = torch.tensor(0.0)

選擇 IOU > 0.7 的 anchor 為 foreground，標(biāo)簽為 1

labels_per_image = matched_idxs >= 0

忽略 0.3 < IOU < 0.7 的 anchor，不參與訓(xùn)練

從 FasterRCNN 類的 __init__ 函數(shù)默認參數(shù)就可以清晰的看到這一點：

rpn_fg_iou_thresh=0.7, rpn_bg_iou_thresh=0.3,

接著來看 RoIHeads 類中的 assign_targets_to_proposals 函數(shù)。

def assign_targets_to_proposals(self, proposals, gt_boxes, gt_labels):
    # type: (List[Tensor], List[Tensor], List[Tensor])
    matched_idxs = []
    labels = []
    for proposals_in_image, gt_boxes_in_image, gt_labels_in_image in zip(proposals, gt_boxes, gt_labels):

        if gt_boxes_in_image.numel() == 0:
            # Background image
            device = proposals_in_image.device
            clamped_matched_idxs_in_image = torch.zeros(
                (proposals_in_image.shape[0],), dtype=torch.int64, device=device
            )
            labels_in_image = torch.zeros(
                (proposals_in_image.shape[0],), dtype=torch.int64, device=device
            )
        else:
            #  set to self.box_similarity when https://github.com/pytorch/pytorch/issues/27495 lands
            match_quality_matrix = box_ops.box_iou(gt_boxes_in_image, proposals_in_image)
            matched_idxs_in_image = self.proposal_matcher(match_quality_matrix)

            clamped_matched_idxs_in_image = matched_idxs_in_image.clamp(min=0)

            labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]
            labels_in_image = labels_in_image.to(dtype=torch.int64)

            # Label background (below the low threshold)
            bg_inds = matched_idxs_in_image == self.proposal_matcher.BELOW_LOW_THRESHOLD
            labels_in_image[bg_inds] = torch.tensor(0)

            # Label ignore proposals (between low and high thresholds)
            ignore_inds = matched_idxs_in_image == self.proposal_matcher.BETWEEN_THRESHOLDS
            labels_in_image[ignore_inds] = torch.tensor(-1)  # -1 is ignored by sampler

        matched_idxs.append(clamped_matched_idxs_in_image)
        labels.append(labels_in_image)
    return matched_idxs, labels

與 assign_targets_to_anchors 不同，該函數(shù)設(shè)置：

box_fg_iou_thresh=0.5, box_bg_iou_thresh=0.5,

IOU > 0.5 的 proposal 為 foreground，標(biāo)簽為對應(yīng)的 class_id

labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]

這里與上面不同：RegionProposalNetwork 只需要判斷 anchor 是否有目標(biāo)，正類別為1；RoIHeads 需要判斷 proposal 的具體類別，所以正類別為具體的 class_id。

IOU < 0.5 的為 background，標(biāo)簽為 0

labels_in_image[bg_inds] = torch.tensor(0)

寫在最后

本文簡要的介紹了 torchvision 中的 FasterRCNN 實現(xiàn)，并分析我認為重要的知識點。寫這篇文章的目的是為閱讀代碼困難的小伙伴做個指引，鼓勵入門新手能夠多看看代碼實現(xiàn)。若要真正的理解模型（不被面試官問住），要是要看代碼！

創(chuàng)作不易，不想被白嫖，求點贊、關(guān)注、收藏（作者）三連！

2020年11月國內(nèi)大數(shù)據(jù)競賽信息-獎池5000萬

Python字典詳解-超級完整版

刷爆網(wǎng)絡(luò)的動態(tài)條形圖，3行Python代碼就能搞定

↓內(nèi)推、交流加小編↓

↓掃描二維碼關(guān)注本號↓

Faster RCNN 官方源碼解讀