国产福利91极品,欧美乱伦一区二区三区,四虎AV极速,久久久波多野结衣,国产乱╳╳╳╳性视频大全,五月丁香欧美,青青操青青操在线视频免费 ,日韩欧美18禁

點擊上方“AI算法與圖像處理”，選擇加"星標(biāo)"或“置頂”

重磅干貨，第一時間送達

來源：AIWalker

最近Happy在嘗試進行圖像超分的INT8量化，發(fā)現(xiàn)：pytorch量化里面的坑真多，遠不如TensorFlow的量化好用。不過花了點時間終于還是用pytorch把圖像超分模型完成了量化，以EDSR為例，模型大小73%,推理速度提升40%左右(PC端)，視覺效果幾乎無損，定量指標(biāo)待補充。有感于網(wǎng)絡(luò)上介紹量化的博客一堆，但真正有幫助的較少，所以Happy會盡量以圖像超分為例提供一個完整的可復(fù)現(xiàn)的量化示例。

????????在前面的文章中，筆者對Pytorch的“Post Training Static Quantization，PTSQ”進行了原理性的介紹。接下來，我們將以EDSR這個圖像超分網(wǎng)絡(luò)為例進行說明。

準(zhǔn)備工作

在真正開始量化之前，我們需要準(zhǔn)備好要進行量化的模型，本文以EDSR-baseline模型為基礎(chǔ)進行。所以大家可以直接下載官方預(yù)訓(xùn)練模型，EDSR的Pytorch官方實現(xiàn)code連接如下：

github.com/thstkdgus35/EDSR-PyTorch

EDSRx4-baseline預(yù)訓(xùn)練模型下載連接如下：

https://cv.snu.ac.kr/research/EDSR/models/edsr_baseline_x4-6b446fab.pt

除了要準(zhǔn)備上述預(yù)訓(xùn)練模型與code外，我們還需要準(zhǔn)備校驗數(shù)據(jù)，在這里筆者采用的DIV2K數(shù)據(jù)，該數(shù)據(jù)集下載鏈接如下：

https://cv.snu.ac.kr/research/EDSR/DIV2K.tar

模型轉(zhuǎn)換

正如上一篇文章所介紹的，在量化之前需要對模型進行op融合操作，而EDSR官方的實現(xiàn)code是對于融合操作是不太方便的，所以筆者對EDSR進行了一些實現(xiàn)上的調(diào)整。調(diào)整成如下形式(注：這里的實現(xiàn)code部分參數(shù)寫成了固定參數(shù))：

class?ResBlock(nn.Module):
????def?__init__(self,?channels=64):
????????super(ResBlock,?self).__init__()
????????self.conv1?=?nn.Conv2d(channels,?channels,?3,?1,?1)
????????self.relu?=?nn.ReLU(inplace=True)
????????self.conv2?=?nn.Conv2d(channels,?channels,?3,?1,?1)

????def?forward(self,?x):
????????identity?=?x
????????conv1?=?self.conv1(x)
????????relu?=?self.relu(conv1)
????????conv2?=?self.conv2(relu)

????????output?=?conv2?+?identity
????????return?output

class?EDSR(nn.Module):
????def?__init__(self,
?????????????????num_blocks=16,
?????????????????num_features=64,
?????????????????block=ResBlock):
????????super(EDSR,?self).__init__()
????????self.head?=?nn.Conv2d(3,?num_features,?3,?1,?1)
????????body?=?[
????????????block(num_features)?for?_?in?range(num_blocks)
????????]
????????body.append(nn.Conv2d(num_features,?num_features,?3,?1,?1))
????????self.body?=?nn.Sequential(*body)
????????self.tail?=?nn.Sequential(
????????????nn.Conv2d(num_features,?num_features?*?4,?3,?1,?1),
????????????nn.PixelShuffle(upscale_factor=2),
????????????nn.Conv2d(num_features,?num_features?*?4,?3,?1,?1),
????????????nn.PixelShuffle(upscale_factor=2),
????????????nn.Conv2d(num_features,?3,?3,?1,?1)
????????)

????def?forward(self,?x,?**kwargs):
????????x?=?self.head(x)
????????res?=?self.body(x)
????????res?+=?x
????????x?=?self.tail(res)
????????return?x

也許有同學(xué)會說，模型轉(zhuǎn)換后原始的預(yù)訓(xùn)練模型還能導(dǎo)入嗎？直接導(dǎo)入肯定是不行的，checkpoint的key發(fā)生了變化，所以我們需要對下載的checkpoint進行一下簡單的轉(zhuǎn)換。checkpoint的轉(zhuǎn)換code如下（注：這些轉(zhuǎn)換可以都是寫死的，已經(jīng)確認(rèn)過的）：

checkpoint?=?torch.load("edsr_baseline_x4-6b446fab.pt",?map_location='cpu')
newStateDict?=?OrderedDict()

for?key,?val?in?checkpoint.items():
????if?'head'?in?key:
????????newStateDict[key.replace('.0.',?'.')]?=?val
????elif?'mean'?in?key:
????????continue
????????#?newStateDict[key]?=?val
????elif?'tail'?in?key:
????????if?'.0.0.'?in?key:
????????????newStateDict[key.replace('.0.0.',?'.0.')]?=?val
????????elif?'.0.2.'?in?key:
????????????newStateDict[key.replace('.0.2.',?'.2.')]?=?val
????????else:
????????????newStateDict[key.replace('.1.',?'.4.')]?=?val
????elif?'body'?in?key:
????????if?'.body.0.'?in?key:
????????????newStateDict[key.replace(".body.0.",?'.conv1.')]?=?val
????????elif?'.body.2.'?in?key:
????????????newStateDict[key.replace(".body.2.",?'.conv2.')]?=?val
????????elif?"16"?in?key:
????????????newStateDict[key]?=?val
torch.save(newStateDict,?"edsr-baseline-fp32.pth.tar")

對比原始code的同學(xué)應(yīng)該會發(fā)現(xiàn)：EDSR中的add_mean與sub_mean不見了。是的，筆者將add_mean與sub_mean移到了網(wǎng)絡(luò)外面，不對其進行量化，具體為什么這樣做，見后面的介紹。

除了上述操作外，我們還需要提供前述EDSR實現(xiàn)的量化版本模型，這個沒太多需要介紹的，直接看code(主要體現(xiàn)在三點：插入量化節(jié)點(即QuantStub與DequantStub)、add轉(zhuǎn)換(即FloatFunctional)、fuse_model模塊(即fuse_model函數(shù)))：

class?QuantizableResBlock(ResBlock):
????def?__init__(self,?*args,?**kwargs):
????????super(QuantizableResBlock,?self).__init__(*args,?**kwargs)
????????self.add?=?FloatFunctional()

????def?forward(self,?x):
????????identity?=?x

????????conv1?=?self.conv1(x)
????????relu?=?self.relu(conv1)
????????conv2?=?self.conv2(relu)

????????output?=?self.add.add(identity,?conv2)
????????return?output

????def?fuse_model(self):
????????fuse_modules(self,?['conv1',?'relu'],?inplace=True)

class?QuantizableEDSR(EDSR):
????def?__init__(self,?*args,?**kwargs):
????????super(QuantizableEDSR,?self).__init__(*args,?**kwargs)

????????self.quant?=?QuantStub()
????????self.dequant?=?DeQuantStub()
????????self.add?=?FloatFunctional()

????def?forward(self,?x):
????????x?=?self.quant(x)
????????x?=?self.head(x)
????????res?=?self.body(x)
????????res?=?self.add.add(res,?x)
????????x?=?self.tail(res)
????????x?=?self.dequant(x)
????????return?x

????def?fuse_model(self):
????????for?m?in?self.modules():
????????????if?type(m)?==?QuantizableResBlock:
????????????????m.fuse_model()

模型量化

在上一篇文章中，我們也介紹了PTSQ的幾個步驟（額外包含了模型的構(gòu)建與保存）。

init: 模型的定義、預(yù)訓(xùn)練模型加載、inplace操作替換為非inplace操作；
config：定義量化時的配置方式，這里以fbgemm為例，它的activation量化方式為Historam，weight量化方式為per_channel；
fuse：模型中的op融合，比如相鄰的Conv+ReLU融合，Conv+BN+ReLU融合等等；
prepare: 量化前的準(zhǔn)備工作，也就是對每個需要進行量化的op插入Observer；
feed: 送入校驗數(shù)據(jù)，前面插入的Observer會針對這些數(shù)據(jù)進行量化前的信息統(tǒng)計;
convert：用于在將非量化op轉(zhuǎn)換成量化op，比如將nn.Conv2d轉(zhuǎn)換成nnq.Conv2d，同時會根據(jù)Observer所觀測的信息進行nnq.Conv2d中的量化參數(shù)的統(tǒng)計，包含scale、zero_point、qweight等；
save：用于保存量化好的模型參數(shù).

Init

模型的創(chuàng)建與預(yù)訓(xùn)練模型，這個比較簡單了，直接上code（注：PTSQ模式下模型應(yīng)當(dāng)是eval模式）。

checkpoint?=?torch.load("edsrx4-baseline-fp32.pth.tar")
model?=?QuantizableEDSR(block=QuantizableResBlock)
model.load_state_dict(checkpoint)
_replace_relu(model)
model.eval()

config

這個步驟主要是為了指定與推理引擎搭配的一些量化方式，比如X86平臺應(yīng)該采用fbgemm方式進行量化，而ARM平臺則應(yīng)當(dāng)采用qnnpack方式量化。本文主要是在PC端進行，所以選擇了fbgemm進行，相關(guān)配置信息如下：

backend?=?'fbgemm'
torch.backends.quantized.engine?=?backend
model.qconfig?=?torch.quantization.QConfig(????activation=default_histogram_observer,????????????????????????????weight=default_per_channel_weight_observer
)

Fuse&Prepare

Fuse與Prepare兩個步驟的作用主要是

進行OP的融合，比如Conv+ReLU的融合，Conv+BN+ReLU的融合，這個可以見前述實現(xiàn)code中的'fuse_model'，pytorch目前提供了幾種類型的融合。我們只需知道就可以了，這塊不用太過關(guān)心，兩行code就可以完成：

model.fuse_model()
torch.quantization.prepare(model,?inplace=True)

插入Observer，在每個需要進行量化的op中插入Observer，不同的量化方式會有不同的Observer，它將對喂入的校驗數(shù)據(jù)進行統(tǒng)計，比如統(tǒng)計數(shù)據(jù)的最大值、最小值、直方圖分布等等。

Feed

這個步驟需要采用校驗數(shù)據(jù)喂入到上述準(zhǔn)備好的模型中，這個就比較簡單了，按照常規(guī)模型的測試方式處理就可以了，參考code如下：

注：筆者這里用了100張數(shù)據(jù)，這個用全部也可以，不過耗時會更長
meanBGR?=?torch.FloatTensor((0.4488,?0.4371,?0.4040)).view(3,?1,?1)?*?255
data_root?=?"${DIV2K_train_LR_bicubic/X4}"
for?index?in?range(1,?100):
????image_path?=?os.path.join(data_root,?f"{index:04d}.png")
????inputs?=?preprocess(image_path)
????inputs?-=?meanBGR

????with?torch.no_grad():
????????output?=?model(inputs)

Convert&Save

在完成前面幾個步驟后，我們就可以將浮點類型的模型進行量化了，這個只需要一行code就可以。在轉(zhuǎn)換過程中，它會將nn.Conv2d這類浮點類型op轉(zhuǎn)換成量化版op：nnq.Conv2d。

torch.quantization.convert(model,?inplace=True)
torch.save(model.state_dict(),?"edsrx4-baseline-qint8.pth.tar")

經(jīng)過上面的幾個步驟，我們就完成了EDSR模型的INT8量化，也將其進行了保存。也就是說完成了初步的量化工作，因為接下來的測試論證很關(guān)鍵，如果量化損失很嚴(yán)重也不行的。

量化模型測試

接下來，我們對上述量化好的模型進行一下測試看看效果。量化模型的調(diào)用code如下(與常規(guī)模型的調(diào)用有一點點的區(qū)別)：

def?fp32edsr(block=ResBlock,?pretrained=None):
????model?=?EDSR(block=block)
????if?pretrained:
????????state_dict?=?torch.load(pretrained,?map_location="cpu")
????????model.load_state_dict(state_dict)
????return?model

def?qint8edsr(block=QuantizableResBlock,?pretrained=None,?quantize=False):
????model?=?QuantizableEDSR(block=block)
????_replace_relu(model)

????if?quantize:
????????backend?=?'fbgemm'
????????quantize_model(model,?backend)
????else:
????????assert??pretrained?in?[True,?False]

????if?pretrained:
????????state_dict?=?torch.load(pretrained,?map_location="cpu")
????????model.load_state_dict(state_dict)

????return?model

def?quantize_model(model,?backend):
????if?backend?not?in?torch.backends.quantized.supported_engines:
????????raise?RuntimeError("Quantized?backend?not?supported?")
????torch.backends.quantized.engine?=?backend
????model.eval()

????_dummy_input_data?=?torch.rand(1,?3,?64,?64)

????#?Make?sure?that?weight?qconfig?matches?that?of?the?serialized?models
????if?backend?==?'fbgemm':
????????model.qconfig?=?torch.quantization.QConfig(
????????????activation=torch.quantization.default_histogram_observer,
????????????weight=torch.quantization.default_per_channel_weight_observer)
????elif?backend?==?'qnnpack':
????????model.qconfig?=?torch.quantization.QConfig(
????????????activation=torch.quantization.default_histogram_observer,
????????????weight=torch.quantization.default_weight_observer)

????model.fuse_model()
????torch.quantization.prepare(model,?inplace=True)
????model(_dummy_input_data)
????torch.quantization.convert(model,?inplace=True)

從上面code可以看到：相比fp32模型，量化模型多了兩步驟：

replace=True的op替換為replace=False的op；
模型的最簡單量化版本，完成初步的op替換。

結(jié)合上述code，我們就可以直接對DIV2K數(shù)據(jù)進行測試了，測試的部分code摘錄如下：

index?=?1
image_path?=?os.path.join(data_root,?f"{index:04d}.png")
inputs?=?preprocess(image_path)
inputs?-=?meanBGR

with?torch.no_grad():
????output1?=?model(inputs)
????output2?=?fmodel(inputs)

output1?+=?meanBGR
output2?+=?meanBGR

show1?=?post_process(output1)
cv2.imwrite(f"results/{index:03d}-init8.png",?show1)
show2?=?post_process(output2)
cv2.imwrite(f"results/{index:03d}-fp32.png",?show2)

上圖給出了DIV2K訓(xùn)練集中0016的兩種模型的效果對比，左圖為FP32模型的超分效果，右圖為INT8量化模型的超分效果。可以看到：量化后模型在效果上是視覺無損的(就是說：量化損失導(dǎo)致的效果下降不可感知)。總而言之，量化前后模型的對比可以參考下表（PC端測試，測試數(shù)據(jù)為DIV2K，速度為平均速度）。

	FP32	INT8	壓縮/提速
ModelSize	5953K	1610K	73%
Speed	5.94s	3.39s	43%

注意事項

為什么要將add_mean與sub_mean移到網(wǎng)絡(luò)外面不參與量化呢？

從我們的量化對比來看，將其移到外面效果更佳?？赡芤哺鷄dd_mean與sub_mean中的參數(shù)有關(guān)，兩者只是簡單的均值處理，這個地方的量化會導(dǎo)致weight值出現(xiàn)較大偏差，進而影響后續(xù)的量化精度。

在量化方式方面，該如何選擇呢？

在量化方式方面，activation支持：HistogramObserver，MinMaxObserver，， weight支持：PerChannelMinMaxObserver，MinMaxObserver. 從我們的量化對比來看，Histogram+PerChannelMinMax這種組合要比MinMaxObserver+PerChannelMinMax更佳。下圖給出了DIV2K訓(xùn)練集中0018數(shù)據(jù)采用第二種量化組合效果對比，可以感知到明顯的量化損失。