SSD的torchvision版本實現(xiàn)詳解
點藍色字關注“機器學習算法工程師”
設為星標,干貨直達!
之前的文章目標檢測算法之SSD已經(jīng)詳細介紹了SSD檢測算法的原理以及實現(xiàn),不過里面只給出了inference的代碼,這個更新版基于SSD的torchvision版本從代碼實現(xiàn)的角度對SSD的各個部分給出深入的解讀(包括數(shù)據(jù)增強,訓練)。
特征提取器(Backbone Feature Extractor)
SSD的backbone采用的是VGG16模型,SSD300的主體網(wǎng)絡結構如下所示:
SSD提取多尺度特征來進行檢測,所以需要在VGG16模型基礎上修改和增加一些額外的模塊。VGG16模型主體包括5個maxpool層,每個maxpool層后特征圖尺度降低1/2,可以看成5個stage,每個stage都是3x3的卷積層,比如最后一個stage包含3個3x3卷積層,分別記為Conv5_1,Conv5_2,Conv5_3(5是stage編號,而后面數(shù)字表示卷積層編號)。圖上所示的Conv4_3對應的就是第4個stage的第3個卷積層的輸出(第4個maxpool層前面一層),對應的特征圖大小為38x38(300/2^3),這是提取的第一個特征,這個特征比較靠前,其norm一般較大,所以后面來額外增加了一個L2 Normalization層。相比原來的VGG16,這里將第5個maxpool層由原來的2x2-s2變成了3x3-s1,此時maxpool后特征圖大小是19x19(沒有降采樣),然后將將VGG16的全連接層fc6和fc7轉換成兩個卷積層:3x3的Conv6和1x1的Conv7,其中Conv6采用dilation=6的空洞卷積,Conv7是提取的第2個用來檢測的特征圖,其大小為19x19。除此之外,SSD還在后面新增了4個模塊來提取更多的特征,每個模塊都包含兩個卷積層:1x1 conv+3x3 conv,3x3卷積層后輸出的特征將用于檢測,分別記為Conv8_2,Conv9_2,Conv10_2和Conv11_2,它們對應的特征圖大小分別是10x10,5x5,3x3和1x1。對于SSD512,其輸入圖像大小更大,所以還額外增加了一個模塊來提取特征,即Conv12_2。特征提取器的代碼實現(xiàn)如下所示:
class?SSDFeatureExtractorVGG(nn.Module):
????def?__init__(self,?backbone:?nn.Module,?highres:?bool):
????????super().__init__()
??
????????#?得到maxpool3和maxpool4的位置,這里backbone是vgg16模型
????????_,?_,?maxpool3_pos,?maxpool4_pos,?_?=?(i?for?i,?layer?in?enumerate(backbone)?if?isinstance(layer,?nn.MaxPool2d))
????????#?maxpool3開啟ceil_mode,這樣得到的特征圖大小是38x38,而不是37x37
????????backbone[maxpool3_pos].ceil_mode?=?True
????????#?Conv4_3的L2?regularization?+?rescaling
????????self.scale_weight?=?nn.Parameter(torch.ones(512)?*?20)
????????#?Conv4_3之前的模塊,用來提取第一個特征圖
????????self.features?=?nn.Sequential(
????????????*backbone[:maxpool4_pos]
????????)
????????#?額外增加的4個模塊
????????extra?=?nn.ModuleList([
????????????nn.Sequential(
????????????????nn.Conv2d(1024,?256,?kernel_size=1),
????????????????nn.ReLU(inplace=True),
????????????????nn.Conv2d(256,?512,?kernel_size=3,?padding=1,?stride=2),??#?conv8_2
????????????????nn.ReLU(inplace=True),
????????????),
????????????nn.Sequential(
????????????????nn.Conv2d(512,?128,?kernel_size=1),
????????????????nn.ReLU(inplace=True),
????????????????nn.Conv2d(128,?256,?kernel_size=3,?padding=1,?stride=2),??#?conv9_2
????????????????nn.ReLU(inplace=True),
????????????),
????????????nn.Sequential(
????????????????nn.Conv2d(256,?128,?kernel_size=1),
????????????????nn.ReLU(inplace=True),
????????????????nn.Conv2d(128,?256,?kernel_size=3),??#?conv10_2
????????????????nn.ReLU(inplace=True),
????????????),
????????????nn.Sequential(
????????????????nn.Conv2d(256,?128,?kernel_size=1),
????????????????nn.ReLU(inplace=True),
????????????????nn.Conv2d(128,?256,?kernel_size=3),??#?conv11_2
????????????????nn.ReLU(inplace=True),
????????????)
????????])
????????if?highres:
????????????#?SSD512還多了一個額外的模塊
????????????extra.append(nn.Sequential(
????????????????nn.Conv2d(256,?128,?kernel_size=1),
????????????????nn.ReLU(inplace=True),
????????????????nn.Conv2d(128,?256,?kernel_size=4),??#?conv12_2
????????????????nn.ReLU(inplace=True),
????????????))
????????_xavier_init(extra)
??
????????#?maxpool5+Conv6(fc6)+Conv7(fc7),這里直接隨機初始化,沒有轉換權重
????????fc?=?nn.Sequential(
????????????nn.MaxPool2d(kernel_size=3,?stride=1,?padding=1,?ceil_mode=False),??#?add?modified?maxpool5
????????????nn.Conv2d(in_channels=512,?out_channels=1024,?kernel_size=3,?padding=6,?dilation=6),??#?FC6?with?atrous
????????????nn.ReLU(inplace=True),
????????????nn.Conv2d(in_channels=1024,?out_channels=1024,?kernel_size=1),??#?FC7
????????????nn.ReLU(inplace=True)
????????)
????????_xavier_init(fc)
????????#?添加Conv5_3,即第2個特征圖
????????extra.insert(0,?nn.Sequential(
????????????*backbone[maxpool4_pos:-1],??#?until?conv5_3,?skip?maxpool5
????????????fc,
????????))
????????self.extra?=?extra
????def?forward(self,?x:?Tensor)?->?Dict[str,?Tensor]:
????????#?Conv4_3
????????x?=?self.features(x)
????????rescaled?=?self.scale_weight.view(1,?-1,?1,?1)?*?F.normalize(x)
????????output?=?[rescaled]
????????#?計算Conv5_3,?Conv8_2,Conv9_2,Conv10_2,Conv11_2,(Conv12_2)
????????for?block?in?self.extra:
????????????x?=?block(x)
????????????output.append(x)
????????return?OrderedDict([(str(i),?v)?for?i,?v?in?enumerate(output)])
采用多尺度來檢測是SSD的一個重要特性,多尺度特征能夠適應不同尺度物體,不過自從FPN提出后,后面大部分的檢測都采用FPN這樣的結構來提取多尺度特征,相比SSD,F(xiàn)PN考慮了不同尺度特征的融合。
檢測頭(Detection Head)
SSD的檢測頭比較簡單:直接在每個特征圖后接一個3x3卷積。這個卷積層的輸出channels=A * (C+4),這里的A是每個位置預測的先驗框數(shù)量,而C是檢測類別數(shù)量(注意包括背景類,C=num_classes+1),除了類別外,還要預測box的位置,這里預測的是box相對先驗框的4個偏移量。代碼實現(xiàn)如下所示(注意這里將類別和回歸預測分開了,但和論文是等價的):
#?基類:實現(xiàn)tensor的轉換,主要將4D tensor轉換成最后預測格式(N, H*W*A, K)
class?SSDScoringHead(nn.Module):
????def?__init__(self,?module_list:?nn.ModuleList,?num_columns:?int):
????????super().__init__()
????????self.module_list?=?module_list
????????self.num_columns?=?num_columns
????def?_get_result_from_module_list(self,?x:?Tensor,?idx:?int)?->?Tensor:
????????"""
????????This?is?equivalent?to?self.module_list[idx](x),
????????but?torchscript?doesn't?support?this?yet
????????"""
????????num_blocks?=?len(self.module_list)
????????if?idx?0:
????????????idx?+=?num_blocks
????????out?=?x
????????for?i,?module?in?enumerate(self.module_list):
????????????if?i?==?idx:
????????????????out?=?module(x)
????????return?out
????def?forward(self,?x:?List[Tensor])?->?Tensor:
????????all_results?=?[]
????????for?i,?features?in?enumerate(x):
????????????results?=?self._get_result_from_module_list(features,?i)
????????????#?Permute?output?from?(N,?A?*?K,?H,?W)?to?(N,?HWA,?K).
????????????N,?_,?H,?W?=?results.shape
????????????results?=?results.view(N,?-1,?self.num_columns,?H,?W)
????????????results?=?results.permute(0,?3,?4,?1,?2)
????????????results?=?results.reshape(N,?-1,?self.num_columns)??#?Size=(N,?HWA,?K)
????????????all_results.append(results)
????????return?torch.cat(all_results,?dim=1)
????
#?類別預測head??
class?SSDClassificationHead(SSDScoringHead):
????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int],?num_classes:?int):
????????cls_logits?=?nn.ModuleList()
????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
????????????cls_logits.append(nn.Conv2d(channels,?num_classes?*?anchors,?kernel_size=3,?padding=1))
????????_xavier_init(cls_logits)
????????super().__init__(cls_logits,?num_classes)
#?位置回歸head
class?SSDRegressionHead(SSDScoringHead):
????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int]):
????????bbox_reg?=?nn.ModuleList()
????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
????????????bbox_reg.append(nn.Conv2d(channels,?4?*?anchors,?kernel_size=3,?padding=1))
????????_xavier_init(bbox_reg)
????????super().__init__(bbox_reg,?4)
????????
class?SSDHead(nn.Module):
????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int],?num_classes:?int):
????????super().__init__()
????????self.classification_head?=?SSDClassificationHead(in_channels,?num_anchors,?num_classes)
????????self.regression_head?=?SSDRegressionHead(in_channels,?num_anchors)
????def?forward(self,?x:?List[Tensor])?->?Dict[str,?Tensor]:
????????return?{
????????????"bbox_regression":?self.regression_head(x),
????????????"cls_logits":?self.classification_head(x),
????????}
注意,每個特征圖的head是不共享的,畢竟存在尺度上的差異。相比之下,RetinaNet的head是采用4個中間卷積層+1個預測卷積層,head是在各個特征圖上是共享的,而分類和回歸采用不同的head。采用更heavy的head無疑有助于提高檢測效果,不過也帶來計算量的增加。
先驗框(Default Box)
SSD是基于anchor的單階段檢測模型,論文里的anchor稱為default box。前面說過SSD300共提取了6個不同的尺度特征,大小分別是38x38,19x19,10x10,5x5,3x3和1x1,每個特征圖的不同位置采用相同的anchor,即同一個特征圖上不同位置的anchor是一樣的;但不同特征圖上設置的anchor是不同的,特征圖越小,放置的anchor的尺度越?。ㄕ撐睦镉靡粋€線性公式來計算不同特征圖的anchor大小)。在SSD中,anchor的中心點是特征圖上單元的中心點,anchor的形狀由兩個參數(shù)控制:scale和aspect_ratio(大小和寬高比),每個特征圖設置scale一樣但aspect_ratio不同的anchor,這里記為第k個特征圖上anchor的scale,并記為anchor的aspect_ratio,那么就可以計算出anchor的寬和高:。具體地,6個特征圖采用的scale分別是0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05,這里的scale是相對圖片尺寸的值,而不是絕對大小。每個特征圖上都包含兩個特殊的anchor,第一個是aspect_ratio=1而scale=的anchor,第2個是aspect_ratio=1而scale=的anchor。除了這兩個特殊的anchor,每個特征圖還包含其它aspect_ratio的anchor:[2, 1/2], [2, 1/2, 3, 1/3], [2, 1/2, 3, 1/3], [2, 1/2, 3, 1/3], [2, 1/2], [2, 1/2],它們的scale都是。具體的代碼實現(xiàn)如下所示:
class?DefaultBoxGenerator(nn.Module):
????"""
????This?module?generates?the?default?boxes?of?SSD?for?a?set?of?feature?maps?and?image?sizes.
????Args:
????????aspect_ratios?(List[List[int]]):?A?list?with?all?the?aspect?ratios?used?in?each?feature?map.
????????min_ratio?(float):?The?minimum?scale?:math:`\text{s}_{\text{min}}`?of?the?default?boxes?used?in?the?estimation
????????????of?the?scales?of?each?feature?map.?It?is?used?only?if?the?``scales``?parameter?is?not?provided.
????????max_ratio?(float):?The?maximum?scale?:math:`\text{s}_{\text{max}}`??of?the?default?boxes?used?in?the?estimation
????????????of?the?scales?of?each?feature?map.?It?is?used?only?if?the?``scales``?parameter?is?not?provided.
????????scales?(List[float]],?optional):?The?scales?of?the?default?boxes.?If?not?provided?it?will?be?estimated?using
????????????the?``min_ratio``?and?``max_ratio``?parameters.
????????steps?(List[int]],?optional):?It's?a?hyper-parameter?that?affects?the?tiling?of?defalt?boxes.?If?not?provided
????????????it?will?be?estimated?from?the?data.
????????clip?(bool):?Whether?the?standardized?values?of?default?boxes?should?be?clipped?between?0?and?1.?The?clipping
????????????is?applied?while?the?boxes?are?encoded?in?format?``(cx,?cy,?w,?h)``.
????"""
????def?__init__(self,?aspect_ratios:?List[List[int]],?min_ratio:?float?=?0.15,?max_ratio:?float?=?0.9,
?????????????????scales:?Optional[List[float]]?=?None,?steps:?Optional[List[int]]?=?None,?clip:?bool?=?True):
????????super().__init__()
????????if?steps?is?not?None:
????????????assert?len(aspect_ratios)?==?len(steps)
????????self.aspect_ratios?=?aspect_ratios
????????self.steps?=?steps
????????self.clip?=?clip
????????num_outputs?=?len(aspect_ratios)
????????#?如果沒有提供scale,那就根據(jù)線性規(guī)則估算各個特征圖上的anchor?scale
????????if?scales?is?None:
????????????if?num_outputs?>?1:
????????????????range_ratio?=?max_ratio?-?min_ratio
????????????????self.scales?=?[min_ratio?+?range_ratio?*?k?/?(num_outputs?-?1.0)?for?k?in?range(num_outputs)]
????????????????self.scales.append(1.0)
????????????else:
????????????????self.scales?=?[min_ratio,?max_ratio]
????????else:
????????????self.scales?=?scales
????????self._wh_pairs?=?self._generate_wh_pairs(num_outputs)
????def?_generate_wh_pairs(self,?num_outputs:?int,?dtype:?torch.dtype?=?torch.float32,
???????????????????????????device:?torch.device?=?torch.device("cpu"))?->?List[Tensor]:
????????_wh_pairs:?List[Tensor]?=?[]
????????for?k?in?range(num_outputs):
????????????#?添加2個默認的anchor
????????????s_k?=?self.scales[k]
????????????s_prime_k?=?math.sqrt(self.scales[k]?*?self.scales[k?+?1])
????????????wh_pairs?=?[[s_k,?s_k],?[s_prime_k,?s_prime_k]]
????????????#?每個aspect?ratio產生兩個成對的anchor
????????????for?ar?in?self.aspect_ratios[k]:
????????????????sq_ar?=?math.sqrt(ar)
????????????????w?=?self.scales[k]?*?sq_ar
????????????????h?=?self.scales[k]?/?sq_ar
????????????????wh_pairs.extend([[w,?h],?[h,?w]])
????????????_wh_pairs.append(torch.as_tensor(wh_pairs,?dtype=dtype,?device=device))
????????return?_wh_pairs
????def?num_anchors_per_location(self):
????????#?每個位置的anchor數(shù)量:?2?+?2?*?len(aspect_ratios).
????????return?[2?+?2?*?len(r)?for?r?in?self.aspect_ratios]
????#?Default?Boxes?calculation?based?on?page?6?of?SSD?paper
????def?_grid_default_boxes(self,?grid_sizes:?List[List[int]],?image_size:?List[int],
????????????????????????????dtype:?torch.dtype?=?torch.float32)?->?Tensor:
????????default_boxes?=?[]
????????for?k,?f_k?in?enumerate(grid_sizes):
????????????#?Now?add?the?default?boxes?for?each?width-height?pair
????????????if?self.steps?is?not?None:?#?step設置為特征圖每個單元對應圖像的像素點數(shù)
????????????????x_f_k,?y_f_k?=?[img_shape?/?self.steps[k]?for?img_shape?in?image_size]
????????????else:
????????????????y_f_k,?x_f_k?=?f_k
???
????????????#?計算anchor中心點
????????????shifts_x?=?((torch.arange(0,?f_k[1])?+?0.5)?/?x_f_k).to(dtype=dtype)
????????????shifts_y?=?((torch.arange(0,?f_k[0])?+?0.5)?/?y_f_k).to(dtype=dtype)
????????????shift_y,?shift_x?=?torch.meshgrid(shifts_y,?shifts_x)
????????????shift_x?=?shift_x.reshape(-1)
????????????shift_y?=?shift_y.reshape(-1)
????????????shifts?=?torch.stack((shift_x,?shift_y)?*?len(self._wh_pairs[k]),?dim=-1).reshape(-1,?2)
????????????#?Clipping?the?default?boxes?while?the?boxes?are?encoded?in?format?(cx,?cy,?w,?h)
????????????_wh_pair?=?self._wh_pairs[k].clamp(min=0,?max=1)?if?self.clip?else?self._wh_pairs[k]
????????????wh_pairs?=?_wh_pair.repeat((f_k[0]?*?f_k[1]),?1)
????????????default_box?=?torch.cat((shifts,?wh_pairs),?dim=1)
????????????default_boxes.append(default_box)
????????return?torch.cat(default_boxes,?dim=0)
????def?forward(self,?image_list:?ImageList,?feature_maps:?List[Tensor])?->?List[Tensor]:
????????#?同一個batch的圖像的大小一樣,所以anchor是一樣的,先計算anchor
????????grid_sizes?=?[feature_map.shape[-2:]?for?feature_map?in?feature_maps]
????????image_size?=?image_list.tensors.shape[-2:]
????????dtype,?device?=?feature_maps[0].dtype,?feature_maps[0].device
????????default_boxes?=?self._grid_default_boxes(grid_sizes,?image_size,?dtype=dtype)
????????default_boxes?=?default_boxes.to(device)
????????dboxes?=?[]
????????for?_?in?image_list.image_sizes:
????????????dboxes_in_image?=?default_boxes
????????????#?(x,y,w,h)?->?(x1,y1,x2,y2)
????????????dboxes_in_image?=?torch.cat([dboxes_in_image[:,?:2]?-?0.5?*?dboxes_in_image[:,?2:],
?????????????????????????????????????????dboxes_in_image[:,?:2]?+?0.5?*?dboxes_in_image[:,?2:]],?-1)
????????????dboxes_in_image[:,?0::2]?*=?image_size[1]?#?乘以圖像大小以得到絕對大小
????????????dboxes_in_image[:,?1::2]?*=?image_size[0]
????????????dboxes.append(dboxes_in_image)
????????return?dboxes
#?SSD300的anchor設置
anchor_generator?=?DefaultBoxGenerator(
????????[[2],?[2,?3],?[2,?3],?[2,?3],?[2],?[2]],
????????scales=[0.07,?0.15,?0.33,?0.51,?0.69,?0.87,?1.05],
????????steps=[8,?16,?32,?64,?100,?300],
????)
前面說過,SSD每個anchor共回歸4個值,這4個值是box相對于anchor的偏移量(offset),這其實涉及到box和anchor之間的變換,一般稱為box的編碼方式。SSD采用和Faster RCNN一樣的編碼方式。具體地,anchor的中心點和寬高為,而要預測的box的中心點和寬高為,我們可以通過下述公式來計算4個偏移量:這4個值就是模型要回歸的target,在預測階段,你可以通過反向變換得到要預測的邊界框,一般情況下會稱box到offset的過程為編碼,而offset到box的過程為解碼。具體的實現(xiàn)代碼如下所示:
def?encode_boxes(reference_boxes:?Tensor,?proposals:?Tensor,?weights:?Tensor)?->?Tensor:
????"""
????Encode?a?set?of?proposals?with?respect?to?some
????reference?boxes
????Args:
????????reference_boxes?(Tensor):?reference?boxes
????????proposals?(Tensor):?boxes?to?be?encoded
????????weights?(Tensor[4]):?the?weights?for?``(x,?y,?w,?h)``
????"""
????#?perform?some?unpacking?to?make?it?JIT-fusion?friendly
????wx?=?weights[0]
????wy?=?weights[1]
????ww?=?weights[2]
????wh?=?weights[3]
????proposals_x1?=?proposals[:,?0].unsqueeze(1)
????proposals_y1?=?proposals[:,?1].unsqueeze(1)
????proposals_x2?=?proposals[:,?2].unsqueeze(1)
????proposals_y2?=?proposals[:,?3].unsqueeze(1)
????reference_boxes_x1?=?reference_boxes[:,?0].unsqueeze(1)
????reference_boxes_y1?=?reference_boxes[:,?1].unsqueeze(1)
????reference_boxes_x2?=?reference_boxes[:,?2].unsqueeze(1)
????reference_boxes_y2?=?reference_boxes[:,?3].unsqueeze(1)
????#?implementation?starts?here
????ex_widths?=?proposals_x2?-?proposals_x1
????ex_heights?=?proposals_y2?-?proposals_y1
????ex_ctr_x?=?proposals_x1?+?0.5?*?ex_widths
????ex_ctr_y?=?proposals_y1?+?0.5?*?ex_heights
????gt_widths?=?reference_boxes_x2?-?reference_boxes_x1
????gt_heights?=?reference_boxes_y2?-?reference_boxes_y1
????gt_ctr_x?=?reference_boxes_x1?+?0.5?*?gt_widths
????gt_ctr_y?=?reference_boxes_y1?+?0.5?*?gt_heights
????targets_dx?=?wx?*?(gt_ctr_x?-?ex_ctr_x)?/?ex_widths
????targets_dy?=?wy?*?(gt_ctr_y?-?ex_ctr_y)?/?ex_heights
????targets_dw?=?ww?*?torch.log(gt_widths?/?ex_widths)
????targets_dh?=?wh?*?torch.log(gt_heights?/?ex_heights)
????targets?=?torch.cat((targets_dx,?targets_dy,?targets_dw,?targets_dh),?dim=1)
????return?targets
class?BoxCoder:
????"""
????This?class?encodes?and?decodes?a?set?of?bounding?boxes?into
????the?representation?used?for?training?the?regressors.
????"""
????def?__init__(
????????self,?weights:?Tuple[float,?float,?float,?float],?bbox_xform_clip:?float?=?math.log(1000.0?/?16)
????)?->?None:
????????"""
????????Args:
????????????weights?(4-element?tuple)
????????????bbox_xform_clip?(float)
????????"""
????????#?實現(xiàn)時會設置4個weights,回歸的target變?yōu)閛ffset*weights
????????self.weights?=?weights
????????self.bbox_xform_clip?=?bbox_xform_clip
?
????#?編碼過程:offset=box-anchor
????def?encode_single(self,?reference_boxes:?Tensor,?proposals:?Tensor)?->?Tensor:
????????"""
????????Encode?a?set?of?proposals?with?respect?to?some
????????reference?boxes
????????Args:
????????????reference_boxes?(Tensor):?reference?boxes
????????????proposals?(Tensor):?boxes?to?be?encoded
????????"""
????????dtype?=?reference_boxes.dtype
????????device?=?reference_boxes.device
????????weights?=?torch.as_tensor(self.weights,?dtype=dtype,?device=device)
????????targets?=?encode_boxes(reference_boxes,?proposals,?weights)
????????return?targets
?
????#?解碼過程:anchor+offset=box
????def?decode_single(self,?rel_codes:?Tensor,?boxes:?Tensor)?->?Tensor:
????????"""
????????From?a?set?of?original?boxes?and?encoded?relative?box?offsets,
????????get?the?decoded?boxes.
????????Args:
????????????rel_codes?(Tensor):?encoded?boxes
????????????boxes?(Tensor):?reference?boxes.
????????"""
????????boxes?=?boxes.to(rel_codes.dtype)
????????widths?=?boxes[:,?2]?-?boxes[:,?0]
????????heights?=?boxes[:,?3]?-?boxes[:,?1]
????????ctr_x?=?boxes[:,?0]?+?0.5?*?widths
????????ctr_y?=?boxes[:,?1]?+?0.5?*?heights
????????wx,?wy,?ww,?wh?=?self.weights
????????dx?=?rel_codes[:,?0::4]?/?wx
????????dy?=?rel_codes[:,?1::4]?/?wy
????????dw?=?rel_codes[:,?2::4]?/?ww
????????dh?=?rel_codes[:,?3::4]?/?wh
????????#?Prevent?sending?too?large?values?into?torch.exp()
????????dw?=?torch.clamp(dw,?max=self.bbox_xform_clip)
????????dh?=?torch.clamp(dh,?max=self.bbox_xform_clip)
????????pred_ctr_x?=?dx?*?widths[:,?None]?+?ctr_x[:,?None]
????????pred_ctr_y?=?dy?*?heights[:,?None]?+?ctr_y[:,?None]
????????pred_w?=?torch.exp(dw)?*?widths[:,?None]
????????pred_h?=?torch.exp(dh)?*?heights[:,?None]
????????#?Distance?from?center?to?box's?corner.
????????c_to_c_h?=?torch.tensor(0.5,?dtype=pred_ctr_y.dtype,?device=pred_h.device)?*?pred_h
????????c_to_c_w?=?torch.tensor(0.5,?dtype=pred_ctr_x.dtype,?device=pred_w.device)?*?pred_w
????????pred_boxes1?=?pred_ctr_x?-?c_to_c_w
????????pred_boxes2?=?pred_ctr_y?-?c_to_c_h
????????pred_boxes3?=?pred_ctr_x?+?c_to_c_w
????????pred_boxes4?=?pred_ctr_y?+?c_to_c_h
????????pred_boxes?=?torch.stack((pred_boxes1,?pred_boxes2,?pred_boxes3,?pred_boxes4),?dim=2).flatten(1)
????????return?pred_boxes
????
box_coder?=?BoxCoder(weights=(10.,?10.,?5.,?5.))
匹配策略(Match ?Strategy)
在訓練階段,首先要確定每個ground truth由哪些anchor來預測才能計算損失,這就是anchor的匹配的策略,有些論文也稱為label assignment策略。SSD的匹配策略是基于IoU的,首先計算所有ground truth和所有anchor的IoU值,然后每個anchor取IoU最大對應的ground truth(保證每個anchor最多預測一個ground truth),如果這個最大IoU值大于某個閾值(SSD設定為0.5),那么anchor就匹配到這個ground truth,即在訓練階段負責預測它。匹配到ground truth的anchor一般稱為正樣本,而沒有匹配到任何ground truth的anchor稱為負樣本,訓練時應該預測為背景類。對于一個ground truth,與其匹配的anchor數(shù)量可能不止一個,但是也可能一個也沒有,此時所有anchor與ground truth的IoU均小于閾值,為了防止這種情況,對每個ground truth,與其IoU最大的anchor一定匹配到它(忽略閾值),這樣就保證每個ground truth至少有一個anchor匹配到。SSD的匹配策略和Faster RCNN有些區(qū)別,F(xiàn)aster RCNN采用雙閾值(0.7,0.3),處于中間閾值的anchor既不是正樣本也不是負樣本,訓練過程不計算損失。但在實現(xiàn)上,SSD的匹配策略可以繼承復用Faster RCNN的邏輯,具體實現(xiàn)如下所示:
# Faster RCNN和RetinaNet的匹配策略:
#?1.?計算所有gt和anchor的IoU
#?2.?對每個anchor,選擇IoU值最大對應的gt
#?3.?若該IoU值大于高閾值,則匹配到對應的gt,若低于低閾值,則為負樣本,處于兩個閾值之間則忽略
#?4.?如果設定allow_low_quality_matches,那么對每個gt,與其IoU最大的anchor一定是正樣本,但
#?這里將這個anchor分配給與其IoU最大的gt(而不一定是原始的那個gt)這里看起來有些奇怪,不過也合理。
class?Matcher:
????"""
????This?class?assigns?to?each?predicted?"element"?(e.g.,?a?box)?a?ground-truth
????element.?Each?predicted?element?will?have?exactly?zero?or?one?matches;?each
????ground-truth?element?may?be?assigned?to?zero?or?more?predicted?elements.
????Matching?is?based?on?the?MxN?match_quality_matrix,?that?characterizes?how?well
????each?(ground-truth,?predicted)-pair?match.?For?example,?if?the?elements?are
????boxes,?the?matrix?may?contain?box?IoU?overlap?values.
????The?matcher?returns?a?tensor?of?size?N?containing?the?index?of?the?ground-truth
????element?m?that?matches?to?prediction?n.?If?there?is?no?match,?a?negative?value
????is?returned.
????"""
????BELOW_LOW_THRESHOLD?=?-1
????BETWEEN_THRESHOLDS?=?-2
????__annotations__?=?{
????????"BELOW_LOW_THRESHOLD":?int,
????????"BETWEEN_THRESHOLDS":?int,
????}
????def?__init__(self,?high_threshold:?float,?low_threshold:?float,?allow_low_quality_matches:?bool?=?False)?->?None:
????????"""
????????Args:
????????????high_threshold?(float):?quality?values?greater?than?or?equal?to
????????????????this?value?are?candidate?matches.
????????????low_threshold?(float):?a?lower?quality?threshold?used?to?stratify
????????????????matches?into?three?levels:
????????????????1)?matches?>=?high_threshold
????????????????2)?BETWEEN_THRESHOLDS?matches?in?[low_threshold,?high_threshold)
????????????????3)?BELOW_LOW_THRESHOLD?matches?in?[0,?low_threshold)
????????????allow_low_quality_matches?(bool):?if?True,?produce?additional?matches
????????????????for?predictions?that?have?only?low-quality?match?candidates.?See
????????????????set_low_quality_matches_?for?more?details.
????????"""
????????self.BELOW_LOW_THRESHOLD?=?-1
????????self.BETWEEN_THRESHOLDS?=?-2
????????assert?low_threshold?<=?high_threshold
????????self.high_threshold?=?high_threshold
????????self.low_threshold?=?low_threshold
????????self.allow_low_quality_matches?=?allow_low_quality_matches
????def?__call__(self,?match_quality_matrix:?Tensor)?->?Tensor:
????????"""
????????Args:
????????????match_quality_matrix?(Tensor[float]):?an?MxN?tensor,?containing?the
????????????pairwise?quality?between?M?ground-truth?elements?and?N?predicted?elements.
????????Returns:
????????????matches?(Tensor[int64]):?an?N?tensor?where?N[i]?is?a?matched?gt?in
????????????[0,?M?-?1]?or?a?negative?value?indicating?that?prediction?i?could?not
????????????be?matched.
????????"""
????????if?match_quality_matrix.numel()?==?0:
????????????#?empty?targets?or?proposals?not?supported?during?training
????????????if?match_quality_matrix.shape[0]?==?0:
????????????????raise?ValueError("No?ground-truth?boxes?available?for?one?of?the?images?during?training")
????????????else:
????????????????raise?ValueError("No?proposal?boxes?available?for?one?of?the?images?during?training")
??
????????#?這里的match_quality_matrix為gt和anchor的IoU
????????#?match_quality_matrix?is?M?(gt)?x?N?(predicted)
????????#?Max?over?gt?elements?(dim?0)?to?find?best?gt?candidate?for?each?prediction
????????#?對每個anchor,找到IoU最大對應的gt
????????matched_vals,?matches?=?match_quality_matrix.max(dim=0)
????????if?self.allow_low_quality_matches:
????????????all_matches?=?matches.clone()
????????else:
????????????all_matches?=?None??#?type:?ignore[assignment]
??
????????#?根據(jù)IoU判定它是正樣本還是負樣本,或者是忽略
????????#?Assign?candidate?matches?with?low?quality?to?negative?(unassigned)?values
????????below_low_threshold?=?matched_vals?????????between_thresholds?=?(matched_vals?>=?self.low_threshold)?&?(matched_vals?????????matches[below_low_threshold]?=?self.BELOW_LOW_THRESHOLD
????????matches[between_thresholds]?=?self.BETWEEN_THRESHOLDS
????????if?self.allow_low_quality_matches:
????????????assert?all_matches?is?not?None
????????????self.set_low_quality_matches_(matches,?all_matches,?match_quality_matrix)
????????return?matches
????def?set_low_quality_matches_(self,?matches:?Tensor,?all_matches:?Tensor,?match_quality_matrix:?Tensor)?->?None:
????????"""
????????Produce?additional?matches?for?predictions?that?have?only?low-quality?matches.
????????Specifically,?for?each?ground-truth?find?the?set?of?predictions?that?have
????????maximum?overlap?with?it?(including?ties);?for?each?prediction?in?that?set,?if
????????it?is?unmatched,?then?match?it?to?the?ground-truth?with?which?it?has?the?highest
????????quality?value.
????????"""
????????#?For?each?gt,?find?the?prediction?with?which?it?has?highest?quality
????????highest_quality_foreach_gt,?_?=?match_quality_matrix.max(dim=1)
????????#?Find?highest?quality?match?available,?even?if?it?is?low,?including?ties
????????gt_pred_pairs_of_highest_quality?=?torch.where(match_quality_matrix?==?highest_quality_foreach_gt[:,?None])
????????#?Example?gt_pred_pairs_of_highest_quality:
????????#???tensor([[????0,?39796],
????????#???????????[????1,?32055],
????????#???????????[????1,?32070],
????????#???????????[????2,?39190],
????????#???????????[????2,?40255],
????????#???????????[????3,?40390],
????????#???????????[????3,?41455],
????????#???????????[????4,?45470],
????????#???????????[????5,?45325],
????????#???????????[????5,?46390]])
????????#?Each?row?is?a?(gt?index,?prediction?index)
????????#?Note?how?gt?items?1,?2,?3,?and?5?each?have?two?ties
????????pred_inds_to_update?=?gt_pred_pairs_of_highest_quality[1]
????????matches[pred_inds_to_update]?=?all_matches[pred_inds_to_update]
class?SSDMatcher(Matcher):
????def?__init__(self,?threshold:?float)?->?None:
????????#?這里高閾值和低閾值設定一樣的值,此時就只有正樣本和負樣本
????????super().__init__(threshold,?threshold,?allow_low_quality_matches=False)
????def?__call__(self,?match_quality_matrix:?Tensor)?->?Tensor:
????????#?找到每個anchor對應IoU最大的gt
????????matches?=?super().__call__(match_quality_matrix)
????????#?For?each?gt,?find?the?prediction?with?which?it?has?the?highest?quality
????????#?找到每個gt對應IoU最大的anchor
????????_,?highest_quality_pred_foreach_gt?=?match_quality_matrix.max(dim=1)
????????#?將這些anchor要匹配的gt改成對應的gt,而無論其IoU是多大
????????matches[highest_quality_pred_foreach_gt]?=?torch.arange(
????????????highest_quality_pred_foreach_gt.size(0),?dtype=torch.int64,?device=highest_quality_pred_foreach_gt.device
????????)
????????return?matches
基于IoU的匹配策略屬于設定好的規(guī)則,它的好處是可以控制正樣本的質量,但壞處是失去了靈活性。目前也有一些研究提出了動態(tài)的匹配策略,即基于模型預測的結果來進行匹配,而不是基于固定的規(guī)則,如DETR和YOLOX。
損失函數(shù)(Loss Function)
根據(jù)匹配策略可以確定每個anchor是正樣本還是負樣本,對于為正樣本的anchor也確定其要預測的ground truth,那么就比較容易計算訓練損失。SSD損失函數(shù)包括兩個部分:分類損失和回歸損失。SSD的分類采用的是softmax多分類,類別數(shù)量為要檢測的類別加上一個背景類,對于正樣本其分類的target就是匹配的ground truth的類別,而負樣本其target為背景類。對于正樣本,同時還要計算回歸損失,這里采用的Smooth L1 loss。對于one-stage檢測模型,一個比較頭疼的問題是訓練過程正樣本嚴重不均衡,SSD采用hard negative mining來平衡正負樣本(RetinaNet采用focal loss)。具體地,基于分類損失對負樣本抽樣,選擇較大的top-k作為訓練的負樣本,以保證正負樣本比例接近1:3。具體的代碼實現(xiàn)如下所示:
????def?compute_loss(
????????self,
????????targets:?List[Dict[str,?Tensor]],
????????head_outputs:?Dict[str,?Tensor],
????????anchors:?List[Tensor],
????????matched_idxs:?List[Tensor],
????)?->?Dict[str,?Tensor]:
????????bbox_regression?=?head_outputs["bbox_regression"]
????????cls_logits?=?head_outputs["cls_logits"]
????????#?Match?original?targets?with?default?boxes
????????num_foreground?=?0
????????bbox_loss?=?[]
????????cls_targets?=?[]
????????for?(
????????????targets_per_image,
????????????bbox_regression_per_image,
????????????cls_logits_per_image,
????????????anchors_per_image,
????????????matched_idxs_per_image,
????????)?in?zip(targets,?bbox_regression,?cls_logits,?anchors,?matched_idxs):
????????????#?確定正樣本
????????????foreground_idxs_per_image?=?torch.where(matched_idxs_per_image?>=?0)[0]
????????????foreground_matched_idxs_per_image?=?matched_idxs_per_image[foreground_idxs_per_image]
????????????num_foreground?+=?foreground_matched_idxs_per_image.numel()
????????????#?計算回歸損失
????????????matched_gt_boxes_per_image?=?targets_per_image["boxes"][foreground_matched_idxs_per_image]
????????????bbox_regression_per_image?=?bbox_regression_per_image[foreground_idxs_per_image,?:]
????????????anchors_per_image?=?anchors_per_image[foreground_idxs_per_image,?:]
????????????target_regression?=?self.box_coder.encode_single(matched_gt_boxes_per_image,?anchors_per_image)
????????????bbox_loss.append(
????????????????torch.nn.functional.smooth_l1_loss(bbox_regression_per_image,?target_regression,?reduction="sum")
????????????)
????????????#?Estimate?ground?truth?for?class?targets
????????????gt_classes_target?=?torch.zeros(
????????????????(cls_logits_per_image.size(0),),
????????????????dtype=targets_per_image["labels"].dtype,
????????????????device=targets_per_image["labels"].device,
????????????)
????????????gt_classes_target[foreground_idxs_per_image]?=?targets_per_image["labels"][
????????????????foreground_matched_idxs_per_image
????????????]
????????????cls_targets.append(gt_classes_target)
????????bbox_loss?=?torch.stack(bbox_loss)
????????cls_targets?=?torch.stack(cls_targets)
????????#?計算分類損失
????????num_classes?=?cls_logits.size(-1)
????????cls_loss?=?F.cross_entropy(cls_logits.view(-1,?num_classes),?cls_targets.view(-1),?reduction="none").view(
????????????cls_targets.size()
????????)
????????#?Hard?Negative?Sampling,這里針對每個樣本,而不是整個batch
????????foreground_idxs?=?cls_targets?>?0
????????num_negative?=?self.neg_to_pos_ratio?*?foreground_idxs.sum(1,?keepdim=True)?#?確定負樣本抽樣數(shù)
????????negative_loss?=?cls_loss.clone()
????????negative_loss[foreground_idxs]?=?-float("inf")??#?use?-inf?to?detect?positive?values?that?creeped?in?the?sample
????????#?對負樣本按照損失降序排列
????????values,?idx?=?negative_loss.sort(1,?descending=True)
????????#?選擇topk
????????background_idxs?=?idx.sort(1)[1]?
????????N?=?max(1,?num_foreground)?#?正樣本數(shù)量
????????return?{
????????????"bbox_regression":?bbox_loss.sum()?/?N,
????????????"classification":?(cls_loss[foreground_idxs].sum()?+?cls_loss[background_idxs].sum())?/?N,
????????}
數(shù)據(jù)增強(Data Augmentation)
SSD采用了比較strong的數(shù)據(jù)增強,其中包括:水平翻轉(horizontal flip),顏色扭曲(color distortion),隨機裁剪(rand crop)和縮小物體(zoom out)。
transforms?=?T.Compose(
????????????????[
????????????????????T.RandomPhotometricDistort(),?#?顏色扭曲
????????????????????T.RandomZoomOut(fill=list(mean)),?#?隨機縮小
????????????????????T.RandomIoUCrop(),?#?隨機裁剪
????????????????????T.RandomHorizontalFlip(p=hflip_prob),?#?水平翻轉
????????????????????T.PILToTensor(),
????????????????????T.ConvertImageDtype(torch.float),
????????????????]
????????????)
前面兩個數(shù)據(jù)增強比較理解和實現(xiàn),這里主要來講述一下后面兩種操作。首先是rand crop,SSD的rand crop是隨機從圖像中crop一塊區(qū)域,這里會限定區(qū)域的scale(相對于原圖)和aspect ratio范圍,對于標注框,只有當box的中心點落在crop的區(qū)域中才保留這個box;同時也設定了一個IoU閾值,只有存在與crop區(qū)域的IoU大于閾值的box時,crop出的區(qū)域才是有效的。這個增強比較復雜,從代碼實現(xiàn)更容易理解:
class?RandomIoUCrop(nn.Module):
????def?__init__(
????????self,
????????min_scale:?float?=?0.3,
????????max_scale:?float?=?1.0,
????????min_aspect_ratio:?float?=?0.5,
????????max_aspect_ratio:?float?=?2.0,
????????sampler_options:?Optional[List[float]]?=?None,
????????trials:?int?=?40,
????):
????????super().__init__()
????????#?Configuration?similar?to?https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_coco.py#L89-L174
????????self.min_scale?=?min_scale
????????self.max_scale?=?max_scale
????????self.min_aspect_ratio?=?min_aspect_ratio
????????self.max_aspect_ratio?=?max_aspect_ratio
????????if?sampler_options?is?None:
????????????sampler_options?=?[0.0,?0.1,?0.3,?0.5,?0.7,?0.9,?1.0]
????????self.options?=?sampler_options
????????self.trials?=?trials
????def?forward(
????????self,?image:?Tensor,?target:?Optional[Dict[str,?Tensor]]?=?None
????)?->?Tuple[Tensor,?Optional[Dict[str,?Tensor]]]:
????????if?target?is?None:
????????????raise?ValueError("The?targets?can't?be?None?for?this?transform.")
????????if?isinstance(image,?torch.Tensor):
????????????if?image.ndimension()?not?in?{2,?3}:
????????????????raise?ValueError(f"image?should?be?2/3?dimensional.?Got?{image.ndimension()}?dimensions.")
????????????elif?image.ndimension()?==?2:
????????????????image?=?image.unsqueeze(0)
????????orig_w,?orig_h?=?F.get_image_size(image)
????????while?True:
????????????#?sample?an?option?隨機選擇一個IoU閾值
????????????idx?=?int(torch.randint(low=0,?high=len(self.options),?size=(1,)))
????????????min_jaccard_overlap?=?self.options[idx]
????????????if?min_jaccard_overlap?>=?1.0:??#?a?value?larger?than?1?encodes?the?leave?as-is?option
????????????????return?image,?target
???
????????????#?由于crop存在限制,所以進行多次嘗試
????????????for?_?in?range(self.trials):
????????????????#?check?the?aspect?ratio?limitations
????????????????#?隨機選擇w和h的scale
????????????????r?=?self.min_scale?+?(self.max_scale?-?self.min_scale)?*?torch.rand(2)
????????????????new_w?=?int(orig_w?*?r[0])
????????????????new_h?=?int(orig_h?*?r[1])
????????????????aspect_ratio?=?new_w?/?new_h
????????????????#?check此時的aspect?ratio是否在限制范圍內,否則無效
????????????????if?not?(self.min_aspect_ratio?<=?aspect_ratio?<=?self.max_aspect_ratio):
????????????????????continue
????????????????#?check?for?0?area?crops
????????????????#?隨機產生crop區(qū)域的左上起點,并check區(qū)域大小是否為0
????????????????r?=?torch.rand(2)
????????????????left?=?int((orig_w?-?new_w)?*?r[0])
????????????????top?=?int((orig_h?-?new_h)?*?r[1])
????????????????right?=?left?+?new_w
????????????????bottom?=?top?+?new_h
????????????????if?left?==?right?or?top?==?bottom:
????????????????????continue
????????????????#?check?for?any?valid?boxes?with?centers?within?the?crop?area
????????????????#?確定有效的標注box:中心點落在區(qū)域內
????????????????cx?=?0.5?*?(target["boxes"][:,?0]?+?target["boxes"][:,?2])
????????????????cy?=?0.5?*?(target["boxes"][:,?1]?+?target["boxes"][:,?3])
????????????????is_within_crop_area?=?(left?????????????????if?not?is_within_crop_area.any():
????????????????????continue
????????????????#?check?at?least?1?box?with?jaccard?limitations
????????????????#?檢查是否存在box和區(qū)域的IoU大于閾值
????????????????boxes?=?target["boxes"][is_within_crop_area]
????????????????ious?=?torchvision.ops.boxes.box_iou(
????????????????????boxes,?torch.tensor([[left,?top,?right,?bottom]],?dtype=boxes.dtype,?device=boxes.device)
????????????????)
????????????????if?ious.max()?????????????????????continue
????????????????#?keep?only?valid?boxes?and?perform?cropping
????????????????#?保留有效的boxes,并進行clip
????????????????target["boxes"]?=?boxes
????????????????target["labels"]?=?target["labels"][is_within_crop_area]
????????????????target["boxes"][:,?0::2]?-=?left
????????????????target["boxes"][:,?1::2]?-=?top
????????????????target["boxes"][:,?0::2].clamp_(min=0,?max=new_w)
????????????????target["boxes"][:,?1::2].clamp_(min=0,?max=new_h)
????????????????image?=?F.crop(image,?top,?left,?new_h,?new_w)
????????????????return?image,?target
這個操作在TensorFlow中也有類似的實現(xiàn),具體見tf.image.sample_distorted_bounding_box。隨機裁剪也相當于對物體進行了放大(zoom in):
zoom out就是縮小物體,這樣就相當于增加了很多包含小物體的訓練樣本,我們可以通過產生一個較大的畫布,然后將圖像隨機放置在里面,具體的實現(xiàn)如下:
class?RandomZoomOut(nn.Module):
????def?__init__(
????????self,?fill:?Optional[List[float]]?=?None,?side_range:?Tuple[float,?float]?=?(1.0,?4.0),?p:?float?=?0.5
????):
????????super().__init__()
????????if?fill?is?None:
????????????fill?=?[0.0,?0.0,?0.0]
????????self.fill?=?fill
????????self.side_range?=?side_range
????????if?side_range[0]?1.0?or?side_range[0]?>?side_range[1]:
????????????raise?ValueError(f"Invalid?canvas?side?range?provided?{side_range}.")
????????self.p?=?p
[email protected]
????def?_get_fill_value(self,?is_pil):
????????#?type:?(bool)?->?int
????????#?We?fake?the?type?to?make?it?work?on?JIT
????????return?tuple(int(x)?for?x?in?self.fill)?if?is_pil?else?0
????def?forward(
????????self,?image:?Tensor,?target:?Optional[Dict[str,?Tensor]]?=?None
????)?->?Tuple[Tensor,?Optional[Dict[str,?Tensor]]]:
????????if?isinstance(image,?torch.Tensor):
????????????if?image.ndimension()?not?in?{2,?3}:
????????????????raise?ValueError(f"image?should?be?2/3?dimensional.?Got?{image.ndimension()}?dimensions.")
????????????elif?image.ndimension()?==?2:
????????????????image?=?image.unsqueeze(0)
????????if?torch.rand(1)?????????????return?image,?target
????????orig_w,?orig_h?=?F.get_image_size(image)
??
????????#?隨機確定畫布的大小
????????r?=?self.side_range[0]?+?torch.rand(1)?*?(self.side_range[1]?-?self.side_range[0])
????????canvas_width?=?int(orig_w?*?r)
????????canvas_height?=?int(orig_h?*?r)
??
????????#?隨機選擇圖像在畫布中的左上點
????????r?=?torch.rand(2)
????????left?=?int((canvas_width?-?orig_w)?*?r[0])
????????top?=?int((canvas_height?-?orig_h)?*?r[1])
????????right?=?canvas_width?-?(left?+?orig_w)
????????bottom?=?canvas_height?-?(top?+?orig_h)
????????if?torch.jit.is_scripting():
????????????fill?=?0
????????else:
????????????fill?=?self._get_fill_value(F._is_pil_image(image))
??
????????#?對圖像padding至畫布大小
????????image?=?F.pad(image,?[left,?top,?right,?bottom],?fill=fill)
??
????????#?轉換boxes
????????if?target?is?not?None:
????????????target["boxes"][:,?0::2]?+=?left
????????????target["boxes"][:,?1::2]?+=?top
????????return?image,?target
這些數(shù)據(jù)增強可以大大擴增訓練樣本,產生了不同尺度的物體,這對于SSD提升檢測性能是比較關鍵的。
訓練(Train)
torchvision官方也給出了復現(xiàn)SSD的訓練超參,具體設置如下所示:
python?-m?torch.distributed.launch?--nproc_per_node=8?--use_env?train.py\
????--dataset?coco?--model?ssd300_vgg16?--epochs?120\
????--lr-steps?80?110?--aspect-ratio-group-factor?3?--lr?0.002?--batch-size?4\
????--weight-decay?0.0005?--data-augmentation?ssd
由于采用較strong的數(shù)據(jù)增強,SSD需要較長的訓練時長:120 epochs(相比之下,其它檢測器如Faster RCNN和RetinaNet往往只用12 epoch和36 epoch)。對于訓練參數(shù),感興趣的也可以對比一下mmdet的訓練設置訓練設置。此外,torchvision官方也給出了他們在復現(xiàn)SSD時不同優(yōu)化對性能的提升對比(具體見Everything You Need To Know About Torchvision's SSD Implementation),如下圖所示,這里的weight init和input scaling指的是采用caffe版本VGG16的weights和歸一化方式。
推理過程(Inference)
SSD的推理過程比較簡單:首先根據(jù)分類預測概率和閾值過濾掉低置信度預測框;然后每個類別選擇top K個預測框;最后通過NMS去除重復框。整個實現(xiàn)代碼如下所示:
????def?postprocess_detections(
????????self,?head_outputs:?Dict[str,?Tensor],?image_anchors:?List[Tensor],?image_shapes:?List[Tuple[int,?int]]
????)?->?List[Dict[str,?Tensor]]:
????????bbox_regression?=?head_outputs["bbox_regression"]
????????pred_scores?=?F.softmax(head_outputs["cls_logits"],?dim=-1)
????????num_classes?=?pred_scores.size(-1)
????????device?=?pred_scores.device
????????detections:?List[Dict[str,?Tensor]]?=?[]
????????for?boxes,?scores,?anchors,?image_shape?in?zip(bbox_regression,?pred_scores,?image_anchors,?image_shapes):
????????????boxes?=?self.box_coder.decode_single(boxes,?anchors)?#?解碼得到預測框
????????????boxes?=?box_ops.clip_boxes_to_image(boxes,?image_shape)?#?clip?box
????????????image_boxes?=?[]
????????????image_scores?=?[]
????????????image_labels?=?[]
????????????#?針對每個類別:過濾低置信度預測框,并選擇topK
????????????for?label?in?range(1,?num_classes):
????????????????score?=?scores[:,?label]
????????????????keep_idxs?=?score?>?self.score_thresh
????????????????score?=?score[keep_idxs]
????????????????box?=?boxes[keep_idxs]
????????????????#?keep?only?topk?scoring?predictions
????????????????num_topk?=?min(self.topk_candidates,?score.size(0))
????????????????score,?idxs?=?score.topk(num_topk)
????????????????box?=?box[idxs]
????????????????image_boxes.append(box)
????????????????image_scores.append(score)
????????????????image_labels.append(torch.full_like(score,?fill_value=label,?dtype=torch.int64,?device=device))
????????????image_boxes?=?torch.cat(image_boxes,?dim=0)
????????????image_scores?=?torch.cat(image_scores,?dim=0)
????????????image_labels?=?torch.cat(image_labels,?dim=0)
????????????# non-maximum suppression:去除重復框
????????????keep?=?box_ops.batched_nms(image_boxes,?image_scores,?image_labels,?self.nms_thresh)
????????????keep?=?keep[:?self.detections_per_img]
????????????detections.append(
????????????????{
????????????????????"boxes":?image_boxes[keep],
????????????????????"scores":?image_scores[keep],
????????????????????"labels":?image_labels[keep],
????????????????}
????????????)
????????return?detections
SSDLite
SSDLite是谷歌在MobileNetv2中設計的一種輕量級SSD,與原來的SSD相比,SSDLite的特征提取器從VGG16換成了MobileNetv2(或者新的MobileNetV3),另外額外增加的預測分支和檢測頭都采用了深度可分類卷積(depthwise 3x3 conv+1x1 conv),這樣網(wǎng)絡的參數(shù)量和計算量大大降低。目前torchvision中也已經(jīng)實現(xiàn)了SSDLite,并且復現(xiàn)了基于MobileNetV3-Large的SSDLite(mAP為21.3,和論文的22.0基本一致)。這里講述一下具體的實現(xiàn)細節(jié)。對于MobileNet特征提取器,和VGG16類似,也是從MobileNet的主體先提取兩個尺度的特征:1/16特征和1/32特征。對于1/32特征就是global avg pooling前的特征(即pooling前的最后一個卷積層輸出);而1/16特征則應該從最后一個stride=2的block前提取特征,這里的block指的是MobileNetV2中提出的inverted residual block,它包括1x1 conv+depthwise 3x3 conv+1x1 conv,第一個1x1卷積一般稱為expansion layer ,而最后一個1x1卷積一般稱為projection layer,對于stride=2的block其stride是放在中間的depthwise 3x3 conv上的,所以最深的1/16特征就應該是最后一個stride=2的block中的expansion layer的輸出。然后額外增加4個預測分支,最后和SSD一樣提取6種不同scale的特征來進行檢測,這里額外增加的分支也采用深度可分離卷積(實際是1x1 conv+depthwise 3x3 conv + 1x1 conv)。具體的代碼實現(xiàn)如下所示:
#?額外增加的預測分支采用1x1?conv+depthwise?3x3?s2?conv?+?1x1?conv
def?_extra_block(in_channels:?int,?out_channels:?int,?norm_layer:?Callable[...,?nn.Module])?->?nn.Sequential:
????activation?=?nn.ReLU6
????intermediate_channels?=?out_channels?//?2
????return?nn.Sequential(
????????#?1x1?projection?to?half?output?channels
????????ConvNormActivation(
????????????in_channels,?intermediate_channels,?kernel_size=1,?norm_layer=norm_layer,?activation_layer=activation
????????),
????????#?3x3?depthwise?with?stride?2?and?padding?1
????????ConvNormActivation(
????????????intermediate_channels,
????????????intermediate_channels,
????????????kernel_size=3,
????????????stride=2,
????????????groups=intermediate_channels,
????????????norm_layer=norm_layer,
????????????activation_layer=activation,
????????),
????????#?1x1?projetion?to?output?channels
????????ConvNormActivation(
????????????intermediate_channels,?out_channels,?kernel_size=1,?norm_layer=norm_layer,?activation_layer=activation
????????),
????)
class?SSDLiteFeatureExtractorMobileNet(nn.Module):
????def?__init__(
????????self,
????????backbone:?nn.Module,
????????c4_pos:?int,
????????norm_layer:?Callable[...,?nn.Module],
????????width_mult:?float?=?1.0,
????????min_depth:?int?=?16,
????):
????????super().__init__()
????????_log_api_usage_once(self)
????????assert?not?backbone[c4_pos].use_res_connect
????????self.features?=?nn.Sequential(
????????????#?As?described?in?section?6.3?of?MobileNetV3?paper
????????????nn.Sequential(*backbone[:c4_pos],?backbone[c4_pos].block[0]),??#?最后一個s=2的block前的模塊+它的expansion?layer
????????????nn.Sequential(backbone[c4_pos].block[1:],?*backbone[c4_pos?+?1?:]),??#?剩余的模塊直至pooling前的卷積
????????)
??
????????#?額外增加的4個預測分支
????????get_depth?=?lambda?d:?max(min_depth,?int(d?*?width_mult))??#?noqa:?E731
????????extra?=?nn.ModuleList(
????????????[
????????????????_extra_block(backbone[-1].out_channels,?get_depth(512),?norm_layer),
????????????????_extra_block(get_depth(512),?get_depth(256),?norm_layer),
????????????????_extra_block(get_depth(256),?get_depth(256),?norm_layer),
????????????????_extra_block(get_depth(256),?get_depth(128),?norm_layer),
????????????]
????????)
????????_normal_init(extra)
????????self.extra?=?extra
????def?forward(self,?x:?Tensor)?->?Dict[str,?Tensor]:
????????#?Get?feature?maps?from?backbone?and?extra.?Can't?be?refactored?due?to?JIT?limitations.
????????output?=?[]
????????for?block?in?self.features:
????????????x?=?block(x)
????????????output.append(x)
????????for?block?in?self.extra:
????????????x?=?block(x)
????????????output.append(x)
????????return?OrderedDict([(str(i),?v)?for?i,?v?in?enumerate(output)])
對于檢測頭,同樣也采用深度可分離卷積,具體的實現(xiàn)如下所示:
#?Building?blocks?of?SSDlite?as?described?in?section?6.2?of?MobileNetV2?paper
def?_prediction_block(
????in_channels:?int,?out_channels:?int,?kernel_size:?int,?norm_layer:?Callable[...,?nn.Module]
)?->?nn.Sequential:
????return?nn.Sequential(
????????#?3x3?depthwise?with?stride?1?and?padding?1
????????ConvNormActivation(
????????????in_channels,
????????????in_channels,
????????????kernel_size=kernel_size,
????????????groups=in_channels,
????????????norm_layer=norm_layer,
????????????activation_layer=nn.ReLU6,
????????),
????????#?1x1?projetion?to?output?channels
????????nn.Conv2d(in_channels,?out_channels,?1),
????)
class?SSDLiteClassificationHead(SSDScoringHead):
????def?__init__(
????????self,?in_channels:?List[int],?num_anchors:?List[int],?num_classes:?int,?norm_layer:?Callable[...,?nn.Module]
????):
????????cls_logits?=?nn.ModuleList()
????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
????????????cls_logits.append(_prediction_block(channels,?num_classes?*?anchors,?3,?norm_layer))
????????_normal_init(cls_logits)
????????super().__init__(cls_logits,?num_classes)
class?SSDLiteRegressionHead(SSDScoringHead):
????def?__init__(self,?in_channels:?List[int],?num_anchors:?List[int],?norm_layer:?Callable[...,?nn.Module]):
????????bbox_reg?=?nn.ModuleList()
????????for?channels,?anchors?in?zip(in_channels,?num_anchors):
????????????bbox_reg.append(_prediction_block(channels,?4?*?anchors,?3,?norm_layer))
????????_normal_init(bbox_reg)
????????super().__init__(bbox_reg,?4)
對于特征提取器,一個額外的細節(jié)是谷歌在MobileNetV3中提出進一步對MobileNetV3的C4-C5之間模塊的channel進行降維(減少為原來的一半),這里的C4-C5模塊指的是最后一個stride=2的block之后的模塊,torchvision的實現(xiàn)中有一個bool類型的reduce_tail參數(shù)來控制這種行為。對于SSDLite,其輸入圖像尺寸為320x320,它采用的先驗框也和SSD有略微的區(qū)別,6個尺度的特征均采用6個先驗框(scale用線性規(guī)則產生),具體的設置如下所示:
#?2個默認先驗框+4個aspect?ratio=[2,?1/2,?3,?1/3]的先驗框
anchor_generator?=?DefaultBoxGenerator([[2,?3]?for?_?in?range(6)],?min_ratio=0.2,?max_ratio=0.95)
由于SSDLite相比SSD,參數(shù)量更小,為了防止訓練過程出現(xiàn)欠擬合,所以數(shù)據(jù)增強策略也更簡化一些(去掉了顏色抖動和zoom out,對于參數(shù)量較小的模型,往往建議采用輕量級的數(shù)據(jù)增強,YOLOX也是這樣處理的):
transforms?=?T.Compose(
????????????????[
????????????????????T.RandomIoUCrop(),
????????????????????T.RandomHorizontalFlip(p=hflip_prob),
????????????????????T.PILToTensor(),
????????????????????T.ConvertImageDtype(torch.float),
????????????????]
????????????)
torchvision官方也在博客Everything You Need To Know About Torchvision's SSDlite Implementation中給出了他們在復現(xiàn)SSDLite的細節(jié),這里也給出不同的優(yōu)化對模型性能提升的影響:
SSDLite的訓練參數(shù)也和SSD有略微的區(qū)別,這里訓練時長更長(長達660 epoch):
torchrun?--nproc_per_node=8?train.py\
????--dataset?coco?--model?ssdlite320_mobilenet_v3_large?--epochs?660\
????--aspect-ratio-group-factor?3?--lr-scheduler?cosineannealinglr?--lr?0.15?--batch-size?24\
????--weight-decay?0.00004?--data-augmentation?ssdlite
這里要說明的一點是,谷歌在TensorFlow Object Detection開源的SSDLite實現(xiàn)中,其分類分支采用的是RetinaNet的focal loss,而不是原來SSD的negative hard mining + softmax loss,如果你認真看model zoo,也會發(fā)現(xiàn)這里SSD ResNet50 FPN和RetinaNet50是一樣的叫法和實現(xiàn)。
最后,這里也整理了一個比較clean且完整的SSD實現(xiàn):https://github.com/xiaohu2015/ssd_pytorch。
參考
SSD: Single Shot MultiBox Detector SSD Slide Everything You Need To Know About Torchvision's SSD Implementation Everything You Need To Know About Torchvision's SSDlite Implementation MobileNetV2: Inverted Residuals and Linear Bottlenecks Searching for MobileNetV3
推薦閱讀
PyTorch1.10發(fā)布:ZeroRedundancyOptimizer和Join
谷歌AI用30億數(shù)據(jù)訓練了一個20億參數(shù)Vision Transformer模型,在ImageNet上達到新的SOTA!
"未來"的經(jīng)典之作ViT:transformer is all you need!
PVT:可用于密集任務backbone的金字塔視覺transformer!
漲點神器FixRes:兩次超越ImageNet數(shù)據(jù)集上的SOTA
不妨試試MoCo,來替換ImageNet上pretrain模型!
機器學習算法工程師
? ??? ? ? ? ? ? ? ? ? ? ????????? ??一個用心的公眾號

