<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          SSD的torchvision版本實現(xiàn)詳解

          共 32287字,需瀏覽 65分鐘

           ·

          2021-12-09 20:01

          點藍色字關注“機器學習算法工程師

          設為星標,干貨直達!


          之前的文章目標檢測算法之SSD已經(jīng)詳細介紹了SSD檢測算法的原理以及實現(xiàn),不過里面只給出了inference的代碼,這個更新版基于SSD的torchvision版本從代碼實現(xiàn)的角度對SSD的各個部分給出深入的解讀(包括數(shù)據(jù)增強,訓練)。

          特征提取器(Backbone Feature Extractor)

          SSD的backbone采用的是VGG16模型,SSD300的主體網(wǎng)絡結構如下所示:SSD提取多尺度特征來進行檢測,所以需要在VGG16模型基礎上修改和增加一些額外的模塊。VGG16模型主體包括5個maxpool層,每個maxpool層后特征圖尺度降低1/2,可以看成5個stage,每個stage都是3x3的卷積層,比如最后一個stage包含3個3x3卷積層,分別記為Conv5_1,Conv5_2,Conv5_3(5是stage編號,而后面數(shù)字表示卷積層編號)。圖上所示的Conv4_3對應的就是第4個stage的第3個卷積層的輸出(第4個maxpool層前面一層),對應的特征圖大小為38x38(300/2^3),這是提取的第一個特征,這個特征比較靠前,其norm一般較大,所以后面來額外增加了一個L2 Normalization層。相比原來的VGG16,這里將第5個maxpool層由原來的2x2-s2變成了3x3-s1,此時maxpool后特征圖大小是19x19(沒有降采樣),然后將將VGG16的全連接層fc6和fc7轉換成兩個卷積層:3x3的Conv6和1x1的Conv7,其中Conv6采用dilation=6的空洞卷積,Conv7是提取的第2個用來檢測的特征圖,其大小為19x19。除此之外,SSD還在后面新增了4個模塊來提取更多的特征,每個模塊都包含兩個卷積層:1x1 conv+3x3 conv,3x3卷積層后輸出的特征將用于檢測,分別記為Conv8_2,Conv9_2,Conv10_2和Conv11_2,它們對應的特征圖大小分別是10x10,5x5,3x3和1x1。對于SSD512,其輸入圖像大小更大,所以還額外增加了一個模塊來提取特征,即Conv12_2。特征提取器的代碼實現(xiàn)如下所示:

          class?SSDFeatureExtractorVGG(nn.Module):
          ????def?__init__(self,?backbone:?nn.Module,?highres:?bool):
          ????????super().__init__()
          ??
          ????????#?得到maxpool3和maxpool4的位置,這里backbone是vgg16模型
          ????????_,?_,?maxpool3_pos,?maxpool4_pos,?_?=?(i?for?i,?layer?in?enumerate(backbone)?if?isinstance(layer,?nn.MaxPool2d))

          ????????#?maxpool3開啟ceil_mode,這樣得到的特征圖大小是38x38,而不是37x37
          ????????backbone[maxpool3_pos].ceil_mode?=?True

          ????????#?Conv4_3的L2?regularization?+?rescaling
          ????????self.scale_weight?=?nn.Parameter(torch.ones(512)?*?20)

          ????????#?Conv4_3之前的模塊,用來提取第一個特征圖
          ????????self.features?=?nn.Sequential(
          ????????????*backbone[:maxpool4_pos]
          ????????)

          ????????#?額外增加的4個模塊
          ????????extra?=?nn.ModuleList([
          ????????????nn.Sequential(
          ????????????????nn.Conv2d(1024,?256,?kernel_size=1),
          ????????????????nn.ReLU(inplace=True),
          ????????????????nn.Conv2d(256,?512,?kernel_size=3,?padding=1,?stride=2),??#?conv8_2
          ????????????????nn.ReLU(inplace=True),
          ????????????),
          ????????????nn.Sequential(
          ????????????????nn.Conv2d(512,?128,?kernel_size=1),
          ????????????????nn.ReLU(inplace=True),
          ????????????????nn.Conv2d(128,?256,?kernel_size=3,?padding=1,?stride=2),??#?conv9_2
          ????????????????nn.ReLU(inplace=True),
          ????????????),
          ????????????nn.Sequential(
          ????????????????nn.Conv2d(256,?128,?kernel_size=1),
          ????????????????nn.ReLU(inplace=True),
          ????????????????nn.Conv2d(128,?256,?kernel_size=3),??#?conv10_2
          ????????????????nn.ReLU(inplace=True),
          ????????????),
          ????????????nn.Sequential(
          ????????????????nn.Conv2d(256,?128,?kernel_size=1),
          ????????????????nn.ReLU(inplace=True),
          ????????????????nn.Conv2d(128,?256,?kernel_size=3),??#?conv11_2
          ????????????????nn.ReLU(inplace=True),
          ????????????)
          ????????])
          ????????if?highres:
          ????????????#?SSD512還多了一個額外的模塊
          ????????????extra.append(nn.Sequential(
          ????????????????nn.Conv2d(256,?128,?kernel_size=1),
          ????????????????nn.ReLU(inplace=True),
          ????????????????nn.Conv2d(128,?256,?kernel_size=4),??#?conv12_2
          ????????????????nn.ReLU(inplace=True),
          ????????????))
          ????????_xavier_init(extra)
          ??
          ????????#?maxpool5+Conv6(fc6)+Conv7(fc7),這里直接隨機初始化,沒有轉換權重
          ????????fc?=?nn.Sequential(
          ????????????nn.MaxPool2d(kernel_size=3,?stride=1,?padding=1,?ceil_mode=False),??#?add?modified?maxpool5
          ????????????nn.Conv2d(in_channels=512,?out_channels=1024,?kernel_size=3,?padding=6,?dilation=6),??#?FC6?with?atrous
          ????????????nn.ReLU(inplace=True),
          ????????????nn.Conv2d(in_channels=1024,?out_channels=1024,?kernel_size=1),??#?FC7
          ????????????nn.ReLU(inplace=True)
          ????????)
          ????????_xavier_init(fc)
          ????????#?添加Conv5_3,即第2個特征圖
          ????????extra.insert(0,?nn.Sequential(
          ????????????*backbone[maxpool4_pos:-1],??#?until?conv5_3,?skip?maxpool5
          ????????????fc,
          ????????))
          ????????self.extra?=?extra

          ????def?forward(self,?x:?Tensor)?->?Dict[str,?Tensor]:
          ????????#?Conv4_3
          ????????x?=?self.features(x)
          ????????rescaled?=?self.scale_weight.view(1,?-1,?1,?1)?*?F.normalize(x)
          ????????output?=?[rescaled]

          ????????#?計算Conv5_3,?Conv8_2,Conv9_2,Conv10_2,Conv11_2,(Conv12_2)
          ????????for?block?in?self.extra:
          ????????????x?=?block(x)
          ????????????output.append(x)

          ????????return?OrderedDict([(str(i),?v)?for?i,?v?in?enumerate(output)])

          采用多尺度來檢測是SSD的一個重要特性,多尺度特征能夠適應不同尺度物體,不過自從FPN提出后,后面大部分的檢測都采用FPN這樣的結構來提取多尺度特征,相比SSD,F(xiàn)PN考慮了不同尺度特征的融合。

          檢測頭(Detection Head)

          SSD的檢測頭比較簡單:直接在每個特征圖后接一個3x3卷積。這個卷積層的輸出channels=A * (C+4),這里的A是每個位置預測的先驗框數(shù)量,而C是檢測類別數(shù)量(注意包括背景類,C=num_classes+1),除了類別外,還要預測box的位置,這里預測的是box相對先驗框的4個偏移量。代碼實現(xiàn)如下所示(注意這里將類別和回歸預測分開了,但和論文是等價的):

          #?基類:實現(xiàn)tensor的轉換,主要將4D tensor轉換成最后預測格式(N, H*W*A, K)
          class?SSDScoringHead(nn.Module):
          ????def?__init__(self,?module_list:?nn.ModuleList,?num_columns:?int):
          ????????super().__init__()
          ????????self.module_list?=?module_list
          ????????self.num_columns?=?num_columns

          ????def?_get_result_from_module_list(self,?x:?Tensor,?idx:?int)?->?Tensor:
          ????????"""
          ????????This?is?equivalent?to?self.module_list[idx](x),
          ????????but?torchscript?doesn't?support?this?yet
          ????????"""

          ????????num_blocks?=?len(self.module_list)
          ????????if?idx?0:
          ????????????idx?+=?num_blocks
          ????????out?=?x
          ????????for?i,?module?in?enumerate(self.module_list):
          ????????????if?i?==?idx:
          ????????????????out?=?module(x)
          ????????return?out

          ????def?forward(self,?x:?List[Tensor])?->?Tensor:
          ????????all_results?=?[]

          ????????for?i,?features?in?enumerate(x):
          ????????????results?=?self._get_result_from_module_list(features,?i)

          ????????????#?Permute?output?from?(N,?A?*?K,?H,?W)?to?(N,?HWA,?K).
          ????????????N,?_,?H,?W?=?results.shape
          ????????????results?=?results.view(N,?-1,?self.num_columns,?H,?W)
          ????????????results?=?results.permute(0,?3,?4,?1,?2)
          ????????????results?=?results.reshape(N,?-1,?self.num_columns)??#?Size=(N,?HWA,?K)

          ????????????all_results.append(results)

          ????????return?torch.cat(all_results,?dim=1)
          ????
          #?類別預測head??
          class?SSDClassificationHead(SSDScoringHead):
          ????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int],?num_classes:?int):
          ????????cls_logits?=?nn.ModuleList()
          ????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
          ????????????cls_logits.append(nn.Conv2d(channels,?num_classes?*?anchors,?kernel_size=3,?padding=1))
          ????????_xavier_init(cls_logits)
          ????????super().__init__(cls_logits,?num_classes)

          #?位置回歸head
          class?SSDRegressionHead(SSDScoringHead):
          ????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int]):
          ????????bbox_reg?=?nn.ModuleList()
          ????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
          ????????????bbox_reg.append(nn.Conv2d(channels,?4?*?anchors,?kernel_size=3,?padding=1))
          ????????_xavier_init(bbox_reg)
          ????????super().__init__(bbox_reg,?4)
          ????????
          class?SSDHead(nn.Module):
          ????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int],?num_classes:?int):
          ????????super().__init__()
          ????????self.classification_head?=?SSDClassificationHead(in_channels,?num_anchors,?num_classes)
          ????????self.regression_head?=?SSDRegressionHead(in_channels,?num_anchors)

          ????def?forward(self,?x:?List[Tensor])?->?Dict[str,?Tensor]:
          ????????return?{
          ????????????"bbox_regression":?self.regression_head(x),
          ????????????"cls_logits":?self.classification_head(x),
          ????????}

          注意,每個特征圖的head是不共享的,畢竟存在尺度上的差異。相比之下,RetinaNet的head是采用4個中間卷積層+1個預測卷積層,head是在各個特征圖上是共享的,而分類和回歸采用不同的head。采用更heavy的head無疑有助于提高檢測效果,不過也帶來計算量的增加。

          先驗框(Default Box)

          SSD是基于anchor的單階段檢測模型,論文里的anchor稱為default box。前面說過SSD300共提取了6個不同的尺度特征,大小分別是38x38,19x19,10x10,5x5,3x3和1x1,每個特征圖的不同位置采用相同的anchor,即同一個特征圖上不同位置的anchor是一樣的;但不同特征圖上設置的anchor是不同的,特征圖越小,放置的anchor的尺度越?。ㄕ撐睦镉靡粋€線性公式來計算不同特征圖的anchor大小)。在SSD中,anchor的中心點是特征圖上單元的中心點,anchor的形狀由兩個參數(shù)控制:scale和aspect_ratio(大小和寬高比),每個特征圖設置scale一樣但aspect_ratio不同的anchor,這里記為第k個特征圖上anchor的scale,并記為anchor的aspect_ratio,那么就可以計算出anchor的寬和高:。具體地,6個特征圖采用的scale分別是0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05,這里的scale是相對圖片尺寸的值,而不是絕對大小。每個特征圖上都包含兩個特殊的anchor,第一個是aspect_ratio=1而scale=的anchor,第2個是aspect_ratio=1而scale=的anchor。除了這兩個特殊的anchor,每個特征圖還包含其它aspect_ratio的anchor:[2, 1/2], [2, 1/2, 3, 1/3], [2, 1/2, 3, 1/3], [2, 1/2, 3, 1/3], [2, 1/2], [2, 1/2],它們的scale都是。具體的代碼實現(xiàn)如下所示:

          class?DefaultBoxGenerator(nn.Module):
          ????"""
          ????This?module?generates?the?default?boxes?of?SSD?for?a?set?of?feature?maps?and?image?sizes.
          ????Args:
          ????????aspect_ratios?(List[List[int]]):?A?list?with?all?the?aspect?ratios?used?in?each?feature?map.
          ????????min_ratio?(float):?The?minimum?scale?:math:`\text{s}_{\text{min}}`?of?the?default?boxes?used?in?the?estimation
          ????????????of?the?scales?of?each?feature?map.?It?is?used?only?if?the?``scales``?parameter?is?not?provided.
          ????????max_ratio?(float):?The?maximum?scale?:math:`\text{s}_{\text{max}}`??of?the?default?boxes?used?in?the?estimation
          ????????????of?the?scales?of?each?feature?map.?It?is?used?only?if?the?``scales``?parameter?is?not?provided.
          ????????scales?(List[float]],?optional):?The?scales?of?the?default?boxes.?If?not?provided?it?will?be?estimated?using
          ????????????the?``min_ratio``?and?``max_ratio``?parameters.
          ????????steps?(List[int]],?optional):?It's?a?hyper-parameter?that?affects?the?tiling?of?defalt?boxes.?If?not?provided
          ????????????it?will?be?estimated?from?the?data.
          ????????clip?(bool):?Whether?the?standardized?values?of?default?boxes?should?be?clipped?between?0?and?1.?The?clipping
          ????????????is?applied?while?the?boxes?are?encoded?in?format?``(cx,?cy,?w,?h)``.
          ????"""


          ????def?__init__(self,?aspect_ratios:?List[List[int]],?min_ratio:?float?=?0.15,?max_ratio:?float?=?0.9,
          ?????????????????scales:?Optional[List[float]]?=?None,?steps:?Optional[List[int]]?=?None,?clip:?bool?=?True)
          :

          ????????super().__init__()
          ????????if?steps?is?not?None:
          ????????????assert?len(aspect_ratios)?==?len(steps)
          ????????self.aspect_ratios?=?aspect_ratios
          ????????self.steps?=?steps
          ????????self.clip?=?clip
          ????????num_outputs?=?len(aspect_ratios)

          ????????#?如果沒有提供scale,那就根據(jù)線性規(guī)則估算各個特征圖上的anchor?scale
          ????????if?scales?is?None:
          ????????????if?num_outputs?>?1:
          ????????????????range_ratio?=?max_ratio?-?min_ratio
          ????????????????self.scales?=?[min_ratio?+?range_ratio?*?k?/?(num_outputs?-?1.0)?for?k?in?range(num_outputs)]
          ????????????????self.scales.append(1.0)
          ????????????else:
          ????????????????self.scales?=?[min_ratio,?max_ratio]
          ????????else:
          ????????????self.scales?=?scales

          ????????self._wh_pairs?=?self._generate_wh_pairs(num_outputs)

          ????def?_generate_wh_pairs(self,?num_outputs:?int,?dtype:?torch.dtype?=?torch.float32,
          ???????????????????????????device:?torch.device?=?torch.device("cpu"))
          ?->?List[Tensor]:

          ????????_wh_pairs:?List[Tensor]?=?[]
          ????????for?k?in?range(num_outputs):
          ????????????#?添加2個默認的anchor
          ????????????s_k?=?self.scales[k]
          ????????????s_prime_k?=?math.sqrt(self.scales[k]?*?self.scales[k?+?1])
          ????????????wh_pairs?=?[[s_k,?s_k],?[s_prime_k,?s_prime_k]]

          ????????????#?每個aspect?ratio產生兩個成對的anchor
          ????????????for?ar?in?self.aspect_ratios[k]:
          ????????????????sq_ar?=?math.sqrt(ar)
          ????????????????w?=?self.scales[k]?*?sq_ar
          ????????????????h?=?self.scales[k]?/?sq_ar
          ????????????????wh_pairs.extend([[w,?h],?[h,?w]])

          ????????????_wh_pairs.append(torch.as_tensor(wh_pairs,?dtype=dtype,?device=device))
          ????????return?_wh_pairs

          ????def?num_anchors_per_location(self):
          ????????#?每個位置的anchor數(shù)量:?2?+?2?*?len(aspect_ratios).
          ????????return?[2?+?2?*?len(r)?for?r?in?self.aspect_ratios]

          ????#?Default?Boxes?calculation?based?on?page?6?of?SSD?paper
          ????def?_grid_default_boxes(self,?grid_sizes:?List[List[int]],?image_size:?List[int],
          ????????????????????????????dtype:?torch.dtype?=?torch.float32)
          ?->?Tensor:

          ????????default_boxes?=?[]
          ????????for?k,?f_k?in?enumerate(grid_sizes):
          ????????????#?Now?add?the?default?boxes?for?each?width-height?pair
          ????????????if?self.steps?is?not?None:?#?step設置為特征圖每個單元對應圖像的像素點數(shù)
          ????????????????x_f_k,?y_f_k?=?[img_shape?/?self.steps[k]?for?img_shape?in?image_size]
          ????????????else:
          ????????????????y_f_k,?x_f_k?=?f_k
          ???
          ????????????#?計算anchor中心點
          ????????????shifts_x?=?((torch.arange(0,?f_k[1])?+?0.5)?/?x_f_k).to(dtype=dtype)
          ????????????shifts_y?=?((torch.arange(0,?f_k[0])?+?0.5)?/?y_f_k).to(dtype=dtype)
          ????????????shift_y,?shift_x?=?torch.meshgrid(shifts_y,?shifts_x)
          ????????????shift_x?=?shift_x.reshape(-1)
          ????????????shift_y?=?shift_y.reshape(-1)

          ????????????shifts?=?torch.stack((shift_x,?shift_y)?*?len(self._wh_pairs[k]),?dim=-1).reshape(-1,?2)
          ????????????#?Clipping?the?default?boxes?while?the?boxes?are?encoded?in?format?(cx,?cy,?w,?h)
          ????????????_wh_pair?=?self._wh_pairs[k].clamp(min=0,?max=1)?if?self.clip?else?self._wh_pairs[k]
          ????????????wh_pairs?=?_wh_pair.repeat((f_k[0]?*?f_k[1]),?1)

          ????????????default_box?=?torch.cat((shifts,?wh_pairs),?dim=1)

          ????????????default_boxes.append(default_box)

          ????????return?torch.cat(default_boxes,?dim=0)

          ????def?forward(self,?image_list:?ImageList,?feature_maps:?List[Tensor])?->?List[Tensor]:
          ????????#?同一個batch的圖像的大小一樣,所以anchor是一樣的,先計算anchor
          ????????grid_sizes?=?[feature_map.shape[-2:]?for?feature_map?in?feature_maps]
          ????????image_size?=?image_list.tensors.shape[-2:]
          ????????dtype,?device?=?feature_maps[0].dtype,?feature_maps[0].device
          ????????default_boxes?=?self._grid_default_boxes(grid_sizes,?image_size,?dtype=dtype)
          ????????default_boxes?=?default_boxes.to(device)

          ????????dboxes?=?[]
          ????????for?_?in?image_list.image_sizes:
          ????????????dboxes_in_image?=?default_boxes
          ????????????#?(x,y,w,h)?->?(x1,y1,x2,y2)
          ????????????dboxes_in_image?=?torch.cat([dboxes_in_image[:,?:2]?-?0.5?*?dboxes_in_image[:,?2:],
          ?????????????????????????????????????????dboxes_in_image[:,?:2]?+?0.5?*?dboxes_in_image[:,?2:]],?-1)
          ????????????dboxes_in_image[:,?0::2]?*=?image_size[1]?#?乘以圖像大小以得到絕對大小
          ????????????dboxes_in_image[:,?1::2]?*=?image_size[0]
          ????????????dboxes.append(dboxes_in_image)
          ????????return?dboxes

          #?SSD300的anchor設置
          anchor_generator?=?DefaultBoxGenerator(
          ????????[[2],?[2,?3],?[2,?3],?[2,?3],?[2],?[2]],
          ????????scales=[0.07,?0.15,?0.33,?0.51,?0.69,?0.87,?1.05],
          ????????steps=[8,?16,?32,?64,?100,?300],
          ????)

          前面說過,SSD每個anchor共回歸4個值,這4個值是box相對于anchor的偏移量(offset),這其實涉及到box和anchor之間的變換,一般稱為box的編碼方式。SSD采用和Faster RCNN一樣的編碼方式。具體地,anchor的中心點和寬高為,而要預測的box的中心點和寬高為,我們可以通過下述公式來計算4個偏移量:這4個值就是模型要回歸的target,在預測階段,你可以通過反向變換得到要預測的邊界框,一般情況下會稱box到offset的過程為編碼,而offset到box的過程為解碼。具體的實現(xiàn)代碼如下所示:

          def?encode_boxes(reference_boxes:?Tensor,?proposals:?Tensor,?weights:?Tensor)?->?Tensor:
          ????"""
          ????Encode?a?set?of?proposals?with?respect?to?some
          ????reference?boxes
          ????Args:
          ????????reference_boxes?(Tensor):?reference?boxes
          ????????proposals?(Tensor):?boxes?to?be?encoded
          ????????weights?(Tensor[4]):?the?weights?for?``(x,?y,?w,?h)``
          ????"""


          ????#?perform?some?unpacking?to?make?it?JIT-fusion?friendly
          ????wx?=?weights[0]
          ????wy?=?weights[1]
          ????ww?=?weights[2]
          ????wh?=?weights[3]

          ????proposals_x1?=?proposals[:,?0].unsqueeze(1)
          ????proposals_y1?=?proposals[:,?1].unsqueeze(1)
          ????proposals_x2?=?proposals[:,?2].unsqueeze(1)
          ????proposals_y2?=?proposals[:,?3].unsqueeze(1)

          ????reference_boxes_x1?=?reference_boxes[:,?0].unsqueeze(1)
          ????reference_boxes_y1?=?reference_boxes[:,?1].unsqueeze(1)
          ????reference_boxes_x2?=?reference_boxes[:,?2].unsqueeze(1)
          ????reference_boxes_y2?=?reference_boxes[:,?3].unsqueeze(1)

          ????#?implementation?starts?here
          ????ex_widths?=?proposals_x2?-?proposals_x1
          ????ex_heights?=?proposals_y2?-?proposals_y1
          ????ex_ctr_x?=?proposals_x1?+?0.5?*?ex_widths
          ????ex_ctr_y?=?proposals_y1?+?0.5?*?ex_heights

          ????gt_widths?=?reference_boxes_x2?-?reference_boxes_x1
          ????gt_heights?=?reference_boxes_y2?-?reference_boxes_y1
          ????gt_ctr_x?=?reference_boxes_x1?+?0.5?*?gt_widths
          ????gt_ctr_y?=?reference_boxes_y1?+?0.5?*?gt_heights

          ????targets_dx?=?wx?*?(gt_ctr_x?-?ex_ctr_x)?/?ex_widths
          ????targets_dy?=?wy?*?(gt_ctr_y?-?ex_ctr_y)?/?ex_heights
          ????targets_dw?=?ww?*?torch.log(gt_widths?/?ex_widths)
          ????targets_dh?=?wh?*?torch.log(gt_heights?/?ex_heights)

          ????targets?=?torch.cat((targets_dx,?targets_dy,?targets_dw,?targets_dh),?dim=1)
          ????return?targets

          class?BoxCoder:
          ????"""
          ????This?class?encodes?and?decodes?a?set?of?bounding?boxes?into
          ????the?representation?used?for?training?the?regressors.
          ????"""


          ????def?__init__(
          ????????self,?weights:?Tuple[float,?float,?float,?float],?bbox_xform_clip:?float?=?math.log(1000.0?/?16)
          ????)
          ?->?None:

          ????????"""
          ????????Args:
          ????????????weights?(4-element?tuple)
          ????????????bbox_xform_clip?(float)
          ????????"""

          ????????#?實現(xiàn)時會設置4個weights,回歸的target變?yōu)閛ffset*weights
          ????????self.weights?=?weights
          ????????self.bbox_xform_clip?=?bbox_xform_clip
          ?
          ????#?編碼過程:offset=box-anchor
          ????def?encode_single(self,?reference_boxes:?Tensor,?proposals:?Tensor)?->?Tensor:
          ????????"""
          ????????Encode?a?set?of?proposals?with?respect?to?some
          ????????reference?boxes
          ????????Args:
          ????????????reference_boxes?(Tensor):?reference?boxes
          ????????????proposals?(Tensor):?boxes?to?be?encoded
          ????????"""

          ????????dtype?=?reference_boxes.dtype
          ????????device?=?reference_boxes.device
          ????????weights?=?torch.as_tensor(self.weights,?dtype=dtype,?device=device)
          ????????targets?=?encode_boxes(reference_boxes,?proposals,?weights)

          ????????return?targets
          ?
          ????#?解碼過程:anchor+offset=box
          ????def?decode_single(self,?rel_codes:?Tensor,?boxes:?Tensor)?->?Tensor:
          ????????"""
          ????????From?a?set?of?original?boxes?and?encoded?relative?box?offsets,
          ????????get?the?decoded?boxes.
          ????????Args:
          ????????????rel_codes?(Tensor):?encoded?boxes
          ????????????boxes?(Tensor):?reference?boxes.
          ????????"""


          ????????boxes?=?boxes.to(rel_codes.dtype)

          ????????widths?=?boxes[:,?2]?-?boxes[:,?0]
          ????????heights?=?boxes[:,?3]?-?boxes[:,?1]
          ????????ctr_x?=?boxes[:,?0]?+?0.5?*?widths
          ????????ctr_y?=?boxes[:,?1]?+?0.5?*?heights

          ????????wx,?wy,?ww,?wh?=?self.weights
          ????????dx?=?rel_codes[:,?0::4]?/?wx
          ????????dy?=?rel_codes[:,?1::4]?/?wy
          ????????dw?=?rel_codes[:,?2::4]?/?ww
          ????????dh?=?rel_codes[:,?3::4]?/?wh

          ????????#?Prevent?sending?too?large?values?into?torch.exp()
          ????????dw?=?torch.clamp(dw,?max=self.bbox_xform_clip)
          ????????dh?=?torch.clamp(dh,?max=self.bbox_xform_clip)

          ????????pred_ctr_x?=?dx?*?widths[:,?None]?+?ctr_x[:,?None]
          ????????pred_ctr_y?=?dy?*?heights[:,?None]?+?ctr_y[:,?None]
          ????????pred_w?=?torch.exp(dw)?*?widths[:,?None]
          ????????pred_h?=?torch.exp(dh)?*?heights[:,?None]

          ????????#?Distance?from?center?to?box's?corner.
          ????????c_to_c_h?=?torch.tensor(0.5,?dtype=pred_ctr_y.dtype,?device=pred_h.device)?*?pred_h
          ????????c_to_c_w?=?torch.tensor(0.5,?dtype=pred_ctr_x.dtype,?device=pred_w.device)?*?pred_w

          ????????pred_boxes1?=?pred_ctr_x?-?c_to_c_w
          ????????pred_boxes2?=?pred_ctr_y?-?c_to_c_h
          ????????pred_boxes3?=?pred_ctr_x?+?c_to_c_w
          ????????pred_boxes4?=?pred_ctr_y?+?c_to_c_h
          ????????pred_boxes?=?torch.stack((pred_boxes1,?pred_boxes2,?pred_boxes3,?pred_boxes4),?dim=2).flatten(1)
          ????????return?pred_boxes
          ????
          box_coder?=?BoxCoder(weights=(10.,?10.,?5.,?5.))

          匹配策略(Match ?Strategy)

          在訓練階段,首先要確定每個ground truth由哪些anchor來預測才能計算損失,這就是anchor的匹配的策略,有些論文也稱為label assignment策略。SSD的匹配策略是基于IoU的,首先計算所有ground truth和所有anchor的IoU值,然后每個anchor取IoU最大對應的ground truth(保證每個anchor最多預測一個ground truth),如果這個最大IoU值大于某個閾值(SSD設定為0.5),那么anchor就匹配到這個ground truth,即在訓練階段負責預測它。匹配到ground truth的anchor一般稱為正樣本,而沒有匹配到任何ground truth的anchor稱為負樣本,訓練時應該預測為背景類。對于一個ground truth,與其匹配的anchor數(shù)量可能不止一個,但是也可能一個也沒有,此時所有anchor與ground truth的IoU均小于閾值,為了防止這種情況,對每個ground truth,與其IoU最大的anchor一定匹配到它(忽略閾值),這樣就保證每個ground truth至少有一個anchor匹配到。SSD的匹配策略和Faster RCNN有些區(qū)別,F(xiàn)aster RCNN采用雙閾值(0.7,0.3),處于中間閾值的anchor既不是正樣本也不是負樣本,訓練過程不計算損失。但在實現(xiàn)上,SSD的匹配策略可以繼承復用Faster RCNN的邏輯,具體實現(xiàn)如下所示:

          # Faster RCNN和RetinaNet的匹配策略:
          #?1.?計算所有gt和anchor的IoU
          #?2.?對每個anchor,選擇IoU值最大對應的gt
          #?3.?若該IoU值大于高閾值,則匹配到對應的gt,若低于低閾值,則為負樣本,處于兩個閾值之間則忽略
          #?4.?如果設定allow_low_quality_matches,那么對每個gt,與其IoU最大的anchor一定是正樣本,但
          #?這里將這個anchor分配給與其IoU最大的gt(而不一定是原始的那個gt)這里看起來有些奇怪,不過也合理。
          class?Matcher:
          ????"""
          ????This?class?assigns?to?each?predicted?"element"?(e.g.,?a?box)?a?ground-truth
          ????element.?Each?predicted?element?will?have?exactly?zero?or?one?matches;?each
          ????ground-truth?element?may?be?assigned?to?zero?or?more?predicted?elements.
          ????Matching?is?based?on?the?MxN?match_quality_matrix,?that?characterizes?how?well
          ????each?(ground-truth,?predicted)-pair?match.?For?example,?if?the?elements?are
          ????boxes,?the?matrix?may?contain?box?IoU?overlap?values.
          ????The?matcher?returns?a?tensor?of?size?N?containing?the?index?of?the?ground-truth
          ????element?m?that?matches?to?prediction?n.?If?there?is?no?match,?a?negative?value
          ????is?returned.
          ????"""


          ????BELOW_LOW_THRESHOLD?=?-1
          ????BETWEEN_THRESHOLDS?=?-2

          ????__annotations__?=?{
          ????????"BELOW_LOW_THRESHOLD":?int,
          ????????"BETWEEN_THRESHOLDS":?int,
          ????}

          ????def?__init__(self,?high_threshold:?float,?low_threshold:?float,?allow_low_quality_matches:?bool?=?False)?->?None:
          ????????"""
          ????????Args:
          ????????????high_threshold?(float):?quality?values?greater?than?or?equal?to
          ????????????????this?value?are?candidate?matches.
          ????????????low_threshold?(float):?a?lower?quality?threshold?used?to?stratify
          ????????????????matches?into?three?levels:
          ????????????????1)?matches?>=?high_threshold
          ????????????????2)?BETWEEN_THRESHOLDS?matches?in?[low_threshold,?high_threshold)
          ????????????????3)?BELOW_LOW_THRESHOLD?matches?in?[0,?low_threshold)
          ????????????allow_low_quality_matches?(bool):?if?True,?produce?additional?matches
          ????????????????for?predictions?that?have?only?low-quality?match?candidates.?See
          ????????????????set_low_quality_matches_?for?more?details.
          ????????"""

          ????????self.BELOW_LOW_THRESHOLD?=?-1
          ????????self.BETWEEN_THRESHOLDS?=?-2
          ????????assert?low_threshold?<=?high_threshold
          ????????self.high_threshold?=?high_threshold
          ????????self.low_threshold?=?low_threshold
          ????????self.allow_low_quality_matches?=?allow_low_quality_matches

          ????def?__call__(self,?match_quality_matrix:?Tensor)?->?Tensor:
          ????????"""
          ????????Args:
          ????????????match_quality_matrix?(Tensor[float]):?an?MxN?tensor,?containing?the
          ????????????pairwise?quality?between?M?ground-truth?elements?and?N?predicted?elements.
          ????????Returns:
          ????????????matches?(Tensor[int64]):?an?N?tensor?where?N[i]?is?a?matched?gt?in
          ????????????[0,?M?-?1]?or?a?negative?value?indicating?that?prediction?i?could?not
          ????????????be?matched.
          ????????"""

          ????????if?match_quality_matrix.numel()?==?0:
          ????????????#?empty?targets?or?proposals?not?supported?during?training
          ????????????if?match_quality_matrix.shape[0]?==?0:
          ????????????????raise?ValueError("No?ground-truth?boxes?available?for?one?of?the?images?during?training")
          ????????????else:
          ????????????????raise?ValueError("No?proposal?boxes?available?for?one?of?the?images?during?training")
          ??
          ????????#?這里的match_quality_matrix為gt和anchor的IoU
          ????????#?match_quality_matrix?is?M?(gt)?x?N?(predicted)
          ????????#?Max?over?gt?elements?(dim?0)?to?find?best?gt?candidate?for?each?prediction
          ????????#?對每個anchor,找到IoU最大對應的gt
          ????????matched_vals,?matches?=?match_quality_matrix.max(dim=0)
          ????????if?self.allow_low_quality_matches:
          ????????????all_matches?=?matches.clone()
          ????????else:
          ????????????all_matches?=?None??#?type:?ignore[assignment]
          ??
          ????????#?根據(jù)IoU判定它是正樣本還是負樣本,或者是忽略
          ????????#?Assign?candidate?matches?with?low?quality?to?negative?(unassigned)?values
          ????????below_low_threshold?=?matched_vals?????????between_thresholds?=?(matched_vals?>=?self.low_threshold)?&?(matched_vals?????????matches[below_low_threshold]?=?self.BELOW_LOW_THRESHOLD
          ????????matches[between_thresholds]?=?self.BETWEEN_THRESHOLDS

          ????????if?self.allow_low_quality_matches:
          ????????????assert?all_matches?is?not?None
          ????????????self.set_low_quality_matches_(matches,?all_matches,?match_quality_matrix)

          ????????return?matches

          ????def?set_low_quality_matches_(self,?matches:?Tensor,?all_matches:?Tensor,?match_quality_matrix:?Tensor)?->?None:
          ????????"""
          ????????Produce?additional?matches?for?predictions?that?have?only?low-quality?matches.
          ????????Specifically,?for?each?ground-truth?find?the?set?of?predictions?that?have
          ????????maximum?overlap?with?it?(including?ties);?for?each?prediction?in?that?set,?if
          ????????it?is?unmatched,?then?match?it?to?the?ground-truth?with?which?it?has?the?highest
          ????????quality?value.
          ????????"""

          ????????#?For?each?gt,?find?the?prediction?with?which?it?has?highest?quality
          ????????highest_quality_foreach_gt,?_?=?match_quality_matrix.max(dim=1)
          ????????#?Find?highest?quality?match?available,?even?if?it?is?low,?including?ties
          ????????gt_pred_pairs_of_highest_quality?=?torch.where(match_quality_matrix?==?highest_quality_foreach_gt[:,?None])
          ????????#?Example?gt_pred_pairs_of_highest_quality:
          ????????#???tensor([[????0,?39796],
          ????????#???????????[????1,?32055],
          ????????#???????????[????1,?32070],
          ????????#???????????[????2,?39190],
          ????????#???????????[????2,?40255],
          ????????#???????????[????3,?40390],
          ????????#???????????[????3,?41455],
          ????????#???????????[????4,?45470],
          ????????#???????????[????5,?45325],
          ????????#???????????[????5,?46390]])
          ????????#?Each?row?is?a?(gt?index,?prediction?index)
          ????????#?Note?how?gt?items?1,?2,?3,?and?5?each?have?two?ties

          ????????pred_inds_to_update?=?gt_pred_pairs_of_highest_quality[1]
          ????????matches[pred_inds_to_update]?=?all_matches[pred_inds_to_update]


          class?SSDMatcher(Matcher):
          ????def?__init__(self,?threshold:?float)?->?None:
          ????????#?這里高閾值和低閾值設定一樣的值,此時就只有正樣本和負樣本
          ????????super().__init__(threshold,?threshold,?allow_low_quality_matches=False)

          ????def?__call__(self,?match_quality_matrix:?Tensor)?->?Tensor:
          ????????#?找到每個anchor對應IoU最大的gt
          ????????matches?=?super().__call__(match_quality_matrix)

          ????????#?For?each?gt,?find?the?prediction?with?which?it?has?the?highest?quality
          ????????#?找到每個gt對應IoU最大的anchor
          ????????_,?highest_quality_pred_foreach_gt?=?match_quality_matrix.max(dim=1)
          ????????#?將這些anchor要匹配的gt改成對應的gt,而無論其IoU是多大
          ????????matches[highest_quality_pred_foreach_gt]?=?torch.arange(
          ????????????highest_quality_pred_foreach_gt.size(0),?dtype=torch.int64,?device=highest_quality_pred_foreach_gt.device
          ????????)

          ????????return?matches

          基于IoU的匹配策略屬于設定好的規(guī)則,它的好處是可以控制正樣本的質量,但壞處是失去了靈活性。目前也有一些研究提出了動態(tài)的匹配策略,即基于模型預測的結果來進行匹配,而不是基于固定的規(guī)則,如DETR和YOLOX。

          損失函數(shù)(Loss Function)

          根據(jù)匹配策略可以確定每個anchor是正樣本還是負樣本,對于為正樣本的anchor也確定其要預測的ground truth,那么就比較容易計算訓練損失。SSD損失函數(shù)包括兩個部分:分類損失和回歸損失。SSD的分類采用的是softmax多分類,類別數(shù)量為要檢測的類別加上一個背景類,對于正樣本其分類的target就是匹配的ground truth的類別,而負樣本其target為背景類。對于正樣本,同時還要計算回歸損失,這里采用的Smooth L1 loss。對于one-stage檢測模型,一個比較頭疼的問題是訓練過程正樣本嚴重不均衡,SSD采用hard negative mining來平衡正負樣本(RetinaNet采用focal loss)。具體地,基于分類損失對負樣本抽樣,選擇較大的top-k作為訓練的負樣本,以保證正負樣本比例接近1:3。具體的代碼實現(xiàn)如下所示:

          ????def?compute_loss(
          ????????self,
          ????????targets:?List[Dict[str,?Tensor]],
          ????????head_outputs:?Dict[str,?Tensor],
          ????????anchors:?List[Tensor],
          ????????matched_idxs:?List[Tensor],
          ????)
          ?->?Dict[str,?Tensor]:

          ????????bbox_regression?=?head_outputs["bbox_regression"]
          ????????cls_logits?=?head_outputs["cls_logits"]

          ????????#?Match?original?targets?with?default?boxes
          ????????num_foreground?=?0
          ????????bbox_loss?=?[]
          ????????cls_targets?=?[]
          ????????for?(
          ????????????targets_per_image,
          ????????????bbox_regression_per_image,
          ????????????cls_logits_per_image,
          ????????????anchors_per_image,
          ????????????matched_idxs_per_image,
          ????????)?in?zip(targets,?bbox_regression,?cls_logits,?anchors,?matched_idxs):
          ????????????#?確定正樣本
          ????????????foreground_idxs_per_image?=?torch.where(matched_idxs_per_image?>=?0)[0]
          ????????????foreground_matched_idxs_per_image?=?matched_idxs_per_image[foreground_idxs_per_image]
          ????????????num_foreground?+=?foreground_matched_idxs_per_image.numel()

          ????????????#?計算回歸損失
          ????????????matched_gt_boxes_per_image?=?targets_per_image["boxes"][foreground_matched_idxs_per_image]
          ????????????bbox_regression_per_image?=?bbox_regression_per_image[foreground_idxs_per_image,?:]
          ????????????anchors_per_image?=?anchors_per_image[foreground_idxs_per_image,?:]
          ????????????target_regression?=?self.box_coder.encode_single(matched_gt_boxes_per_image,?anchors_per_image)
          ????????????bbox_loss.append(
          ????????????????torch.nn.functional.smooth_l1_loss(bbox_regression_per_image,?target_regression,?reduction="sum")
          ????????????)

          ????????????#?Estimate?ground?truth?for?class?targets
          ????????????gt_classes_target?=?torch.zeros(
          ????????????????(cls_logits_per_image.size(0),),
          ????????????????dtype=targets_per_image["labels"].dtype,
          ????????????????device=targets_per_image["labels"].device,
          ????????????)
          ????????????gt_classes_target[foreground_idxs_per_image]?=?targets_per_image["labels"][
          ????????????????foreground_matched_idxs_per_image
          ????????????]
          ????????????cls_targets.append(gt_classes_target)

          ????????bbox_loss?=?torch.stack(bbox_loss)
          ????????cls_targets?=?torch.stack(cls_targets)

          ????????#?計算分類損失
          ????????num_classes?=?cls_logits.size(-1)
          ????????cls_loss?=?F.cross_entropy(cls_logits.view(-1,?num_classes),?cls_targets.view(-1),?reduction="none").view(
          ????????????cls_targets.size()
          ????????)

          ????????#?Hard?Negative?Sampling,這里針對每個樣本,而不是整個batch
          ????????foreground_idxs?=?cls_targets?>?0
          ????????num_negative?=?self.neg_to_pos_ratio?*?foreground_idxs.sum(1,?keepdim=True)?#?確定負樣本抽樣數(shù)
          ????????negative_loss?=?cls_loss.clone()
          ????????negative_loss[foreground_idxs]?=?-float("inf")??#?use?-inf?to?detect?positive?values?that?creeped?in?the?sample
          ????????#?對負樣本按照損失降序排列
          ????????values,?idx?=?negative_loss.sort(1,?descending=True)
          ????????#?選擇topk
          ????????background_idxs?=?idx.sort(1)[1]?
          ????????N?=?max(1,?num_foreground)?#?正樣本數(shù)量
          ????????return?{
          ????????????"bbox_regression":?bbox_loss.sum()?/?N,
          ????????????"classification":?(cls_loss[foreground_idxs].sum()?+?cls_loss[background_idxs].sum())?/?N,
          ????????}

          數(shù)據(jù)增強(Data Augmentation)

          SSD采用了比較strong的數(shù)據(jù)增強,其中包括:水平翻轉(horizontal flip),顏色扭曲(color distortion),隨機裁剪(rand crop)和縮小物體(zoom out)。

          transforms?=?T.Compose(
          ????????????????[
          ????????????????????T.RandomPhotometricDistort(),?#?顏色扭曲
          ????????????????????T.RandomZoomOut(fill=list(mean)),?#?隨機縮小
          ????????????????????T.RandomIoUCrop(),?#?隨機裁剪
          ????????????????????T.RandomHorizontalFlip(p=hflip_prob),?#?水平翻轉
          ????????????????????T.PILToTensor(),
          ????????????????????T.ConvertImageDtype(torch.float),
          ????????????????]
          ????????????)

          前面兩個數(shù)據(jù)增強比較理解和實現(xiàn),這里主要來講述一下后面兩種操作。首先是rand crop,SSD的rand crop是隨機從圖像中crop一塊區(qū)域,這里會限定區(qū)域的scale(相對于原圖)和aspect ratio范圍,對于標注框,只有當box的中心點落在crop的區(qū)域中才保留這個box;同時也設定了一個IoU閾值,只有存在與crop區(qū)域的IoU大于閾值的box時,crop出的區(qū)域才是有效的。這個增強比較復雜,從代碼實現(xiàn)更容易理解:

          class?RandomIoUCrop(nn.Module):
          ????def?__init__(
          ????????self,
          ????????min_scale:?float?=?0.3,
          ????????max_scale:?float?=?1.0,
          ????????min_aspect_ratio:?float?=?0.5,
          ????????max_aspect_ratio:?float?=?2.0,
          ????????sampler_options:?Optional[List[float]]?=?None,
          ????????trials:?int?=?40,
          ????)
          :

          ????????super().__init__()
          ????????#?Configuration?similar?to?https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_coco.py#L89-L174
          ????????self.min_scale?=?min_scale
          ????????self.max_scale?=?max_scale
          ????????self.min_aspect_ratio?=?min_aspect_ratio
          ????????self.max_aspect_ratio?=?max_aspect_ratio
          ????????if?sampler_options?is?None:
          ????????????sampler_options?=?[0.0,?0.1,?0.3,?0.5,?0.7,?0.9,?1.0]
          ????????self.options?=?sampler_options
          ????????self.trials?=?trials

          ????def?forward(
          ????????self,?image:?Tensor,?target:?Optional[Dict[str,?Tensor]]?=?None
          ????)
          ?->?Tuple[Tensor,?Optional[Dict[str,?Tensor]]]:

          ????????if?target?is?None:
          ????????????raise?ValueError("The?targets?can't?be?None?for?this?transform.")

          ????????if?isinstance(image,?torch.Tensor):
          ????????????if?image.ndimension()?not?in?{2,?3}:
          ????????????????raise?ValueError(f"image?should?be?2/3?dimensional.?Got?{image.ndimension()}?dimensions.")
          ????????????elif?image.ndimension()?==?2:
          ????????????????image?=?image.unsqueeze(0)

          ????????orig_w,?orig_h?=?F.get_image_size(image)

          ????????while?True:
          ????????????#?sample?an?option?隨機選擇一個IoU閾值
          ????????????idx?=?int(torch.randint(low=0,?high=len(self.options),?size=(1,)))
          ????????????min_jaccard_overlap?=?self.options[idx]
          ????????????if?min_jaccard_overlap?>=?1.0:??#?a?value?larger?than?1?encodes?the?leave?as-is?option
          ????????????????return?image,?target
          ???
          ????????????#?由于crop存在限制,所以進行多次嘗試
          ????????????for?_?in?range(self.trials):
          ????????????????#?check?the?aspect?ratio?limitations
          ????????????????#?隨機選擇w和h的scale
          ????????????????r?=?self.min_scale?+?(self.max_scale?-?self.min_scale)?*?torch.rand(2)
          ????????????????new_w?=?int(orig_w?*?r[0])
          ????????????????new_h?=?int(orig_h?*?r[1])
          ????????????????aspect_ratio?=?new_w?/?new_h
          ????????????????#?check此時的aspect?ratio是否在限制范圍內,否則無效
          ????????????????if?not?(self.min_aspect_ratio?<=?aspect_ratio?<=?self.max_aspect_ratio):
          ????????????????????continue

          ????????????????#?check?for?0?area?crops
          ????????????????#?隨機產生crop區(qū)域的左上起點,并check區(qū)域大小是否為0
          ????????????????r?=?torch.rand(2)
          ????????????????left?=?int((orig_w?-?new_w)?*?r[0])
          ????????????????top?=?int((orig_h?-?new_h)?*?r[1])
          ????????????????right?=?left?+?new_w
          ????????????????bottom?=?top?+?new_h
          ????????????????if?left?==?right?or?top?==?bottom:
          ????????????????????continue

          ????????????????#?check?for?any?valid?boxes?with?centers?within?the?crop?area
          ????????????????#?確定有效的標注box:中心點落在區(qū)域內
          ????????????????cx?=?0.5?*?(target["boxes"][:,?0]?+?target["boxes"][:,?2])
          ????????????????cy?=?0.5?*?(target["boxes"][:,?1]?+?target["boxes"][:,?3])
          ????????????????is_within_crop_area?=?(left?????????????????if?not?is_within_crop_area.any():
          ????????????????????continue

          ????????????????#?check?at?least?1?box?with?jaccard?limitations
          ????????????????#?檢查是否存在box和區(qū)域的IoU大于閾值
          ????????????????boxes?=?target["boxes"][is_within_crop_area]
          ????????????????ious?=?torchvision.ops.boxes.box_iou(
          ????????????????????boxes,?torch.tensor([[left,?top,?right,?bottom]],?dtype=boxes.dtype,?device=boxes.device)
          ????????????????)
          ????????????????if?ious.max()?????????????????????continue

          ????????????????#?keep?only?valid?boxes?and?perform?cropping
          ????????????????#?保留有效的boxes,并進行clip
          ????????????????target["boxes"]?=?boxes
          ????????????????target["labels"]?=?target["labels"][is_within_crop_area]
          ????????????????target["boxes"][:,?0::2]?-=?left
          ????????????????target["boxes"][:,?1::2]?-=?top
          ????????????????target["boxes"][:,?0::2].clamp_(min=0,?max=new_w)
          ????????????????target["boxes"][:,?1::2].clamp_(min=0,?max=new_h)
          ????????????????image?=?F.crop(image,?top,?left,?new_h,?new_w)

          ????????????????return?image,?target

          這個操作在TensorFlow中也有類似的實現(xiàn),具體見tf.image.sample_distorted_bounding_box。隨機裁剪也相當于對物體進行了放大(zoom in):zoom out就是縮小物體,這樣就相當于增加了很多包含小物體的訓練樣本,我們可以通過產生一個較大的畫布,然后將圖像隨機放置在里面,具體的實現(xiàn)如下:

          class?RandomZoomOut(nn.Module):
          ????def?__init__(
          ????????self,?fill:?Optional[List[float]]?=?None,?side_range:?Tuple[float,?float]?=?(1.0,?4.0),?p:?float?=?0.5
          ????)
          :

          ????????super().__init__()
          ????????if?fill?is?None:
          ????????????fill?=?[0.0,?0.0,?0.0]
          ????????self.fill?=?fill
          ????????self.side_range?=?side_range
          ????????if?side_range[0]?1.0?or?side_range[0]?>?side_range[1]:
          ????????????raise?ValueError(f"Invalid?canvas?side?range?provided?{side_range}.")
          ????????self.p?=?p

          [email protected]
          ????def?_get_fill_value(self,?is_pil):
          ????????#?type:?(bool)?->?int
          ????????#?We?fake?the?type?to?make?it?work?on?JIT
          ????????return?tuple(int(x)?for?x?in?self.fill)?if?is_pil?else?0

          ????def?forward(
          ????????self,?image:?Tensor,?target:?Optional[Dict[str,?Tensor]]?=?None
          ????)
          ?->?Tuple[Tensor,?Optional[Dict[str,?Tensor]]]:

          ????????if?isinstance(image,?torch.Tensor):
          ????????????if?image.ndimension()?not?in?{2,?3}:
          ????????????????raise?ValueError(f"image?should?be?2/3?dimensional.?Got?{image.ndimension()}?dimensions.")
          ????????????elif?image.ndimension()?==?2:
          ????????????????image?=?image.unsqueeze(0)

          ????????if?torch.rand(1)?????????????return?image,?target

          ????????orig_w,?orig_h?=?F.get_image_size(image)
          ??
          ????????#?隨機確定畫布的大小
          ????????r?=?self.side_range[0]?+?torch.rand(1)?*?(self.side_range[1]?-?self.side_range[0])
          ????????canvas_width?=?int(orig_w?*?r)
          ????????canvas_height?=?int(orig_h?*?r)
          ??
          ????????#?隨機選擇圖像在畫布中的左上點
          ????????r?=?torch.rand(2)
          ????????left?=?int((canvas_width?-?orig_w)?*?r[0])
          ????????top?=?int((canvas_height?-?orig_h)?*?r[1])
          ????????right?=?canvas_width?-?(left?+?orig_w)
          ????????bottom?=?canvas_height?-?(top?+?orig_h)

          ????????if?torch.jit.is_scripting():
          ????????????fill?=?0
          ????????else:
          ????????????fill?=?self._get_fill_value(F._is_pil_image(image))
          ??
          ????????#?對圖像padding至畫布大小
          ????????image?=?F.pad(image,?[left,?top,?right,?bottom],?fill=fill)
          ??
          ????????#?轉換boxes
          ????????if?target?is?not?None:
          ????????????target["boxes"][:,?0::2]?+=?left
          ????????????target["boxes"][:,?1::2]?+=?top

          ????????return?image,?target

          這些數(shù)據(jù)增強可以大大擴增訓練樣本,產生了不同尺度的物體,這對于SSD提升檢測性能是比較關鍵的。

          訓練(Train)

          torchvision官方也給出了復現(xiàn)SSD的訓練超參,具體設置如下所示:

          python?-m?torch.distributed.launch?--nproc_per_node=8?--use_env?train.py\
          ????--dataset?coco?--model?ssd300_vgg16?--epochs?120\
          ????--lr-steps?80?110?--aspect-ratio-group-factor?3?--lr?0.002?--batch-size?4\
          ????--weight-decay?0.0005?--data-augmentation?ssd

          由于采用較strong的數(shù)據(jù)增強,SSD需要較長的訓練時長:120 epochs(相比之下,其它檢測器如Faster RCNN和RetinaNet往往只用12 epoch和36 epoch)。對于訓練參數(shù),感興趣的也可以對比一下mmdet的訓練設置訓練設置。此外,torchvision官方也給出了他們在復現(xiàn)SSD時不同優(yōu)化對性能的提升對比(具體見Everything You Need To Know About Torchvision's SSD Implementation),如下圖所示,這里的weight init和input scaling指的是采用caffe版本VGG16的weights和歸一化方式。

          推理過程(Inference)

          SSD的推理過程比較簡單:首先根據(jù)分類預測概率和閾值過濾掉低置信度預測框;然后每個類別選擇top K個預測框;最后通過NMS去除重復框。整個實現(xiàn)代碼如下所示:

          ????def?postprocess_detections(
          ????????self,?head_outputs:?Dict[str,?Tensor],?image_anchors:?List[Tensor],?image_shapes:?List[Tuple[int,?int]]
          ????)
          ?->?List[Dict[str,?Tensor]]:

          ????????bbox_regression?=?head_outputs["bbox_regression"]
          ????????pred_scores?=?F.softmax(head_outputs["cls_logits"],?dim=-1)

          ????????num_classes?=?pred_scores.size(-1)
          ????????device?=?pred_scores.device

          ????????detections:?List[Dict[str,?Tensor]]?=?[]

          ????????for?boxes,?scores,?anchors,?image_shape?in?zip(bbox_regression,?pred_scores,?image_anchors,?image_shapes):
          ????????????boxes?=?self.box_coder.decode_single(boxes,?anchors)?#?解碼得到預測框
          ????????????boxes?=?box_ops.clip_boxes_to_image(boxes,?image_shape)?#?clip?box

          ????????????image_boxes?=?[]
          ????????????image_scores?=?[]
          ????????????image_labels?=?[]
          ????????????#?針對每個類別:過濾低置信度預測框,并選擇topK
          ????????????for?label?in?range(1,?num_classes):
          ????????????????score?=?scores[:,?label]

          ????????????????keep_idxs?=?score?>?self.score_thresh
          ????????????????score?=?score[keep_idxs]
          ????????????????box?=?boxes[keep_idxs]

          ????????????????#?keep?only?topk?scoring?predictions
          ????????????????num_topk?=?min(self.topk_candidates,?score.size(0))
          ????????????????score,?idxs?=?score.topk(num_topk)
          ????????????????box?=?box[idxs]

          ????????????????image_boxes.append(box)
          ????????????????image_scores.append(score)
          ????????????????image_labels.append(torch.full_like(score,?fill_value=label,?dtype=torch.int64,?device=device))

          ????????????image_boxes?=?torch.cat(image_boxes,?dim=0)
          ????????????image_scores?=?torch.cat(image_scores,?dim=0)
          ????????????image_labels?=?torch.cat(image_labels,?dim=0)

          ????????????# non-maximum suppression:去除重復框
          ????????????keep?=?box_ops.batched_nms(image_boxes,?image_scores,?image_labels,?self.nms_thresh)
          ????????????keep?=?keep[:?self.detections_per_img]

          ????????????detections.append(
          ????????????????{
          ????????????????????"boxes":?image_boxes[keep],
          ????????????????????"scores":?image_scores[keep],
          ????????????????????"labels":?image_labels[keep],
          ????????????????}
          ????????????)
          ????????return?detections

          SSDLite

          SSDLite是谷歌在MobileNetv2中設計的一種輕量級SSD,與原來的SSD相比,SSDLite的特征提取器從VGG16換成了MobileNetv2(或者新的MobileNetV3),另外額外增加的預測分支和檢測頭都采用了深度可分類卷積(depthwise 3x3 conv+1x1 conv),這樣網(wǎng)絡的參數(shù)量和計算量大大降低。目前torchvision中也已經(jīng)實現(xiàn)了SSDLite,并且復現(xiàn)了基于MobileNetV3-Large的SSDLite(mAP為21.3,和論文的22.0基本一致)。這里講述一下具體的實現(xiàn)細節(jié)。對于MobileNet特征提取器,和VGG16類似,也是從MobileNet的主體先提取兩個尺度的特征:1/16特征和1/32特征。對于1/32特征就是global avg pooling前的特征(即pooling前的最后一個卷積層輸出);而1/16特征則應該從最后一個stride=2的block前提取特征,這里的block指的是MobileNetV2中提出的inverted residual block,它包括1x1 conv+depthwise 3x3 conv+1x1 conv,第一個1x1卷積一般稱為expansion layer ,而最后一個1x1卷積一般稱為projection layer,對于stride=2的block其stride是放在中間的depthwise 3x3 conv上的,所以最深的1/16特征就應該是最后一個stride=2的block中的expansion layer的輸出。然后額外增加4個預測分支,最后和SSD一樣提取6種不同scale的特征來進行檢測,這里額外增加的分支也采用深度可分離卷積(實際是1x1 conv+depthwise 3x3 conv + 1x1 conv)。具體的代碼實現(xiàn)如下所示:

          #?額外增加的預測分支采用1x1?conv+depthwise?3x3?s2?conv?+?1x1?conv
          def?_extra_block(in_channels:?int,?out_channels:?int,?norm_layer:?Callable[...,?nn.Module])?->?nn.Sequential:
          ????activation?=?nn.ReLU6
          ????intermediate_channels?=?out_channels?//?2
          ????return?nn.Sequential(
          ????????#?1x1?projection?to?half?output?channels
          ????????ConvNormActivation(
          ????????????in_channels,?intermediate_channels,?kernel_size=1,?norm_layer=norm_layer,?activation_layer=activation
          ????????),
          ????????#?3x3?depthwise?with?stride?2?and?padding?1
          ????????ConvNormActivation(
          ????????????intermediate_channels,
          ????????????intermediate_channels,
          ????????????kernel_size=3,
          ????????????stride=2,
          ????????????groups=intermediate_channels,
          ????????????norm_layer=norm_layer,
          ????????????activation_layer=activation,
          ????????),
          ????????#?1x1?projetion?to?output?channels
          ????????ConvNormActivation(
          ????????????intermediate_channels,?out_channels,?kernel_size=1,?norm_layer=norm_layer,?activation_layer=activation
          ????????),
          ????)

          class?SSDLiteFeatureExtractorMobileNet(nn.Module):
          ????def?__init__(
          ????????self,
          ????????backbone:?nn.Module,
          ????????c4_pos:?int,
          ????????norm_layer:?Callable[...,?nn.Module],
          ????????width_mult:?float?=?1.0,
          ????????min_depth:?int?=?16,
          ????)
          :

          ????????super().__init__()
          ????????_log_api_usage_once(self)

          ????????assert?not?backbone[c4_pos].use_res_connect
          ????????self.features?=?nn.Sequential(
          ????????????#?As?described?in?section?6.3?of?MobileNetV3?paper
          ????????????nn.Sequential(*backbone[:c4_pos],?backbone[c4_pos].block[0]),??#?最后一個s=2的block前的模塊+它的expansion?layer
          ????????????nn.Sequential(backbone[c4_pos].block[1:],?*backbone[c4_pos?+?1?:]),??#?剩余的模塊直至pooling前的卷積
          ????????)
          ??
          ????????#?額外增加的4個預測分支
          ????????get_depth?=?lambda?d:?max(min_depth,?int(d?*?width_mult))??#?noqa:?E731
          ????????extra?=?nn.ModuleList(
          ????????????[
          ????????????????_extra_block(backbone[-1].out_channels,?get_depth(512),?norm_layer),
          ????????????????_extra_block(get_depth(512),?get_depth(256),?norm_layer),
          ????????????????_extra_block(get_depth(256),?get_depth(256),?norm_layer),
          ????????????????_extra_block(get_depth(256),?get_depth(128),?norm_layer),
          ????????????]
          ????????)
          ????????_normal_init(extra)

          ????????self.extra?=?extra

          ????def?forward(self,?x:?Tensor)?->?Dict[str,?Tensor]:
          ????????#?Get?feature?maps?from?backbone?and?extra.?Can't?be?refactored?due?to?JIT?limitations.
          ????????output?=?[]
          ????????for?block?in?self.features:
          ????????????x?=?block(x)
          ????????????output.append(x)

          ????????for?block?in?self.extra:
          ????????????x?=?block(x)
          ????????????output.append(x)

          ????????return?OrderedDict([(str(i),?v)?for?i,?v?in?enumerate(output)])

          對于檢測頭,同樣也采用深度可分離卷積,具體的實現(xiàn)如下所示:

          #?Building?blocks?of?SSDlite?as?described?in?section?6.2?of?MobileNetV2?paper
          def?_prediction_block(
          ????in_channels:?int,?out_channels:?int,?kernel_size:?int,?norm_layer:?Callable[...,?nn.Module]
          )
          ?->?nn.Sequential:

          ????return?nn.Sequential(
          ????????#?3x3?depthwise?with?stride?1?and?padding?1
          ????????ConvNormActivation(
          ????????????in_channels,
          ????????????in_channels,
          ????????????kernel_size=kernel_size,
          ????????????groups=in_channels,
          ????????????norm_layer=norm_layer,
          ????????????activation_layer=nn.ReLU6,
          ????????),
          ????????#?1x1?projetion?to?output?channels
          ????????nn.Conv2d(in_channels,?out_channels,?1),
          ????)

          class?SSDLiteClassificationHead(SSDScoringHead):
          ????def?__init__(
          ????????self,?in_channels:?List[int],?num_anchors:?List[int],?num_classes:?int,?norm_layer:?Callable[...,?nn.Module]
          ????)
          :

          ????????cls_logits?=?nn.ModuleList()
          ????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
          ????????????cls_logits.append(_prediction_block(channels,?num_classes?*?anchors,?3,?norm_layer))
          ????????_normal_init(cls_logits)
          ????????super().__init__(cls_logits,?num_classes)


          class?SSDLiteRegressionHead(SSDScoringHead):
          ????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int],?norm_layer:?Callable[...,?nn.Module]):
          ????????bbox_reg?=?nn.ModuleList()
          ????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
          ????????????bbox_reg.append(_prediction_block(channels,?4?*?anchors,?3,?norm_layer))
          ????????_normal_init(bbox_reg)
          ????????super().__init__(bbox_reg,?4)

          對于特征提取器,一個額外的細節(jié)是谷歌在MobileNetV3中提出進一步對MobileNetV3的C4-C5之間模塊的channel進行降維(減少為原來的一半),這里的C4-C5模塊指的是最后一個stride=2的block之后的模塊,torchvision的實現(xiàn)中有一個bool類型的reduce_tail參數(shù)來控制這種行為。對于SSDLite,其輸入圖像尺寸為320x320,它采用的先驗框也和SSD有略微的區(qū)別,6個尺度的特征均采用6個先驗框(scale用線性規(guī)則產生),具體的設置如下所示:

          #?2個默認先驗框+4個aspect?ratio=[2,?1/2,?3,?1/3]的先驗框
          anchor_generator?=?DefaultBoxGenerator([[2,?3]?for?_?in?range(6)],?min_ratio=0.2,?max_ratio=0.95)

          由于SSDLite相比SSD,參數(shù)量更小,為了防止訓練過程出現(xiàn)欠擬合,所以數(shù)據(jù)增強策略也更簡化一些(去掉了顏色抖動和zoom out,對于參數(shù)量較小的模型,往往建議采用輕量級的數(shù)據(jù)增強,YOLOX也是這樣處理的):

          transforms?=?T.Compose(
          ????????????????[
          ????????????????????T.RandomIoUCrop(),
          ????????????????????T.RandomHorizontalFlip(p=hflip_prob),
          ????????????????????T.PILToTensor(),
          ????????????????????T.ConvertImageDtype(torch.float),
          ????????????????]
          ????????????)

          torchvision官方也在博客Everything You Need To Know About Torchvision's SSDlite Implementation中給出了他們在復現(xiàn)SSDLite的細節(jié),這里也給出不同的優(yōu)化對模型性能提升的影響:SSDLite的訓練參數(shù)也和SSD有略微的區(qū)別,這里訓練時長更長(長達660 epoch):

          torchrun?--nproc_per_node=8?train.py\
          ????--dataset?coco?--model?ssdlite320_mobilenet_v3_large?--epochs?660\
          ????--aspect-ratio-group-factor?3?--lr-scheduler?cosineannealinglr?--lr?0.15?--batch-size?24\
          ????--weight-decay?0.00004?--data-augmentation?ssdlite

          這里要說明的一點是,谷歌在TensorFlow Object Detection開源的SSDLite實現(xiàn)中,其分類分支采用的是RetinaNet的focal loss,而不是原來SSD的negative hard mining + softmax loss,如果你認真看model zoo,也會發(fā)現(xiàn)這里SSD ResNet50 FPN和RetinaNet50是一樣的叫法和實現(xiàn)。

          最后,這里也整理了一個比較clean且完整的SSD實現(xiàn):https://github.com/xiaohu2015/ssd_pytorch。

          參考

          • SSD: Single Shot MultiBox Detector
          • SSD Slide
          • Everything You Need To Know About Torchvision's SSD Implementation
          • Everything You Need To Know About Torchvision's SSDlite Implementation
          • MobileNetV2: Inverted Residuals and Linear Bottlenecks
          • Searching for MobileNetV3


          推薦閱讀

          CPVT:一個卷積就可以隱式編碼位置信息

          SOTA模型Swin Transformer是如何煉成的!

          快來解鎖PyTorch新技能:torch.fix

          RegNet:設計網(wǎng)絡設計空間

          PyTorch1.10發(fā)布:ZeroRedundancyOptimizer和Join

          谷歌AI用30億數(shù)據(jù)訓練了一個20億參數(shù)Vision Transformer模型,在ImageNet上達到新的SOTA!

          BatchNorm的避坑指南(上)

          BatchNorm的避坑指南(下)

          目標跟蹤入門篇-相關濾波

          SOTA模型Swin Transformer是如何煉成的!

          MoCo V3:我并不是你想的那樣!

          Transformer在語義分割上的應用

          "未來"的經(jīng)典之作ViT:transformer is all you need!

          PVT:可用于密集任務backbone的金字塔視覺transformer!

          漲點神器FixRes:兩次超越ImageNet數(shù)據(jù)集上的SOTA

          Transformer為何能闖入CV界秒殺CNN?

          不妨試試MoCo,來替換ImageNet上pretrain模型!


          機器學習算法工程師


          ? ??? ? ? ? ? ? ? ? ? ? ????????? ??一個用心的公眾號


          瀏覽 62
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          分享
          舉報
          評論
          圖片
          表情
          推薦
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          分享
          舉報
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  國產精品77777777777 | 无码人妻一区二区三区毛片视频 | 亚洲护士无码 | 亚洲色图导航 | 18禁久久网站 |