點(diǎn)擊上方“小白學(xué)視覺”，選擇加"星標(biāo)"或“置頂”

重磅干貨，第一時間送達(dá)

我們都知道，在許多計算機(jī)視覺任務(wù)中，卷積神經(jīng)網(wǎng)絡(luò)（CNN）的性能均優(yōu)于人類。所有基于CNN的模型都具有與卷積層相同的基本體系結(jié)構(gòu)，其后是具有中間批處理歸一化層的池化層，用于在前向遍歷中對批次進(jìn)行歸一化并在后向遍歷中控制梯度。

然而，在CNN中，使用最大池化層有幾個缺點(diǎn)，因為它沒有考慮具有最大值的像素與其直接相鄰像素之間的關(guān)系。為了解決該問題，Hinton提出了膠囊網(wǎng)絡(luò)的思想和一種稱為“膠囊之間的動態(tài)路由”的算法。這篇文章中，我們將介紹該模型的實現(xiàn)細(xì)節(jié)。

TensorFlow操作

在TensorFlow 2.3中使用功能API或順序模型構(gòu)建模型非常容易，只需很少的代碼行。但是，在此膠囊網(wǎng)絡(luò)的實現(xiàn)中，我們利用Functional API和一些自定義操作，并使用@tf.function修飾它們以進(jìn)行優(yōu)化。在本節(jié)中，將重點(diǎn)介紹tf.matmul較大尺寸的功能。

tf.matmul

對于2D矩陣，只要遵守形狀符號，matmul操作即可執(zhí)行矩陣乘法操作。但是，對于秩為（r> 2）的張量，該運(yùn)算變?yōu)?個運(yùn)算的組合，即逐元素乘法和矩陣乘法。

對于秩（r = 4）的矩陣，它首先沿軸= [0，1]進(jìn)行傳播(broadcasring)，并使它們的形狀相同。并且僅當(dāng)?shù)谝粋€張量的最后一個維度和第二個張量的第二個至最后一個維度具有匹配的維度時，最后兩個軸（[2,3]）才會進(jìn)行矩陣乘法。為了簡潔起見，下面的示例將對其進(jìn)行說明，為簡便起見，僅輸出形狀，但可以在控制臺上隨意輸出并計算數(shù)字。

>>> w = tf.reshape(tf.range(48), (1,8,3,2))>>> x = tf.reshape(tf.range(40), (5,1,2,4))>>> tf.matmul(w, x).shapeTensorShape([5, 8, 3, 4])

w沿軸= 0廣播，x沿軸= 1廣播，其余兩個維矩陣相乘。讓我們檢查一下matmul的transpose_a / transpose_b參數(shù)。在張量上調(diào)用tf.transpose時，所有尺寸都將反轉(zhuǎn)。例如，

>>> a = tf.reshape(tf.range(48), (1,8,3,2))>>> tf.transpose(a).shapeTensorShape([2, 3, 8, 1])

因此，讓我們看看它是如何工作的?tf.matmul

>>> w = tf.ones((1,10,16,1))>>> x = tf.ones((1152,1,16,1))>>> tf.matmul(w, x, transpose_a=True).shapeTensorShape([1152, 10, 1, 1])

TensorFlow所做的工作首先是沿前兩個維度廣播，然后將它們假定為2D矩陣堆棧。您可以將其可視化為僅應(yīng)用于第一個數(shù)組的最后兩個維度的轉(zhuǎn)置。轉(zhuǎn)置操作后的第一個數(shù)組的形狀為[1152、10、1、16]（將轉(zhuǎn)置應(yīng)用于最后一個二維），現(xiàn)在應(yīng)用矩陣乘法。順便說一下，transpose_a = True將上述轉(zhuǎn)置操作應(yīng)用于matmul中設(shè)置的第一元件。

膠囊層分類

讓我們看看代碼中發(fā)生了什么。

convolution = tf.keras.layers.Conv2D(256, [9,9], strides=[1,1], name='ConvolutionLayer', activation='relu')primary_capsule = tf.keras.layers.Conv2D(32 * 8, [9,9], strides=[2,2], name="PrimaryCapsule")
x = self.convolution(input_x) # x.shape: (None, 20, 20, 256)x = self.primary_capsule(x) # x.shape: (None, 6, 6, 256)

我們已經(jīng)使用tf.keras功能性API創(chuàng)建主要的膠囊輸出。這些將僅在輸入圖像的正向傳遞中執(zhí)行簡單的卷積操作input_x。到現(xiàn)在為止，我們已經(jīng)得到了256（32 * 8）個特征圖，每個特征圖的大小為6 x 6。

現(xiàn)在，我們不再將上述特征圖可視化為卷積輸出，而是將它們重新想象為沿最后一個軸堆積的32-6 x 6 x 8個向量。因此，我們只需重整形狀就可以輕松獲得6 * 6 * 32 = 1125，8D向量。這些向量中的每一個都與權(quán)重矩陣相乘，該權(quán)重矩陣封裝了這些較低層特征和較高層特征之間的關(guān)系。Primary Capsule層中輸出要素的尺寸為8D，而Digit Caps層中的輸入要素的尺寸為16D。因此，基本上我們必須將它們與16 X 8矩陣相乘。而主膠囊中有1152個向量，這意味著我們將擁有1152–16 x 8個矩陣。

在下一層中，我們有10個數(shù)字膠囊，因此，我們將有10個這樣的1152–16 x 8矩陣。因此，基本上我們得到了形狀為[1152，10,16，8]的權(quán)重張量。主膠囊輸出的1152-8D向量中的每一個都對10位數(shù)字膠囊中的每一個都有貢獻(xiàn)，因此我們可以簡單地對數(shù)字膠囊層中的每個膠囊使用相同的8D向量。更簡單地說，我們可以在1152個8D向量中添加2個新軸，從而將它們轉(zhuǎn)換為[1152、1、8、1]的形狀。

w = tf.Variable(tf.random_normal_initializer()(shape=[1, 1152, 10, 16, 8]), dtype=tf.float32, name="Pose_Estimation")
u = tf.reshape(x, (-1, 1152, 8))u = tf.expand_dims(u, axis=-2) # u.shape: (None, 1152, 1, 8)u = tf.expand_dims(u,axis=-1) # # u.shape: (None, 1152, 1, 8, 1)# In the matrix multiplication: (1, 1152, 10, 16, 8) x (None, 1152, 1, 8, 1) -> (None, 1152, 10, 16, 1)u_hat?=?tf.matmul(w,?u)?#?u_hat.shape:?(None,?1152,?10,?16,?1)

注意：變量W的形狀沿第一個軸的額外尺寸為1，因為這樣一來，整個batch必須廣播相同的權(quán)重。

在u_hat中，最后一個維度是多余的，為矩陣乘法的正確性添加了最后一個維度，因此現(xiàn)在可以使用擠壓函數(shù)將其刪除。上述形狀中的batch_size是在訓(xùn)練時確定的。

u_hat = tf.squeeze(u_hat, [4]) # u_hat.shape: (None, 1152, 10, 16)

動態(tài)路由

在探索算法之前，讓我們先構(gòu)造擠壓函數(shù)(squash function)并保留以備將來使用。如果分母的總和為零，需要添加了一個小值的epsilon，以防止梯度爆炸。

epsilon?=?1e-7def squash(self, s):  s_norm = tf.norm(s, axis=-1, keepdims=True)  return tf.square(s_norm)/(1 + tf.square(s_norm)) * s/(s_norm + epsilon)

在此步驟中，數(shù)字膠囊的輸入是16D向量（u_hat），路由迭代次數(shù)（r = 3）。

需要注意以下問題：

1.該c表示的概率分布u_hat值和在所述主膠囊層中的特定膠囊時，它概括為1。簡單地說，u_hat的值根據(jù)其在路由算法訓(xùn)練變量c的數(shù)字膠囊中均有分布。

2.所述ΣcijUj|i是輸入到數(shù)字膠囊低層向量的所有加權(quán)求和。由于存在1152個較低級別的向量，因此reduce_sum函數(shù)將應(yīng)用到該維度。設(shè)置keep_dims=True，只會使進(jìn)一步的計算更加容易。

3.擠壓函數(shù)非線性將應(yīng)用于Digit Capsule的16D向量，以將值歸一化。

4.下一步是一個很巧妙的應(yīng)用，在該應(yīng)用中，將計算數(shù)字囊層的輸入和輸出之間的點(diǎn)積。這種點(diǎn)積決定了低級和高級級膠囊之間的“協(xié)議”。

重復(fù)上述循環(huán)3次，然后將由此獲得的v值用于重建網(wǎng)絡(luò)。

重建網(wǎng)絡(luò)

重建網(wǎng)絡(luò)是一種可從Digit Capsule Layers的特征再生圖像的正則化器。在反向傳播時，它會影響整個網(wǎng)絡(luò)，因此使功能既適合預(yù)測又適合再生。在訓(xùn)練期間，模型使用輸入圖像的實際標(biāo)簽將數(shù)字上限的值屏蔽為零（與標(biāo)簽相對應(yīng)的數(shù)字除外）（如下圖所示）。

來自上述網(wǎng)絡(luò)的張量的形狀為（None，1，10，16），我們沿著Digit Caps圖層的16D矢量廣播和標(biāo)記，并應(yīng)用掩膜。

注意：一個熱編碼標(biāo)簽用于掩膜。

y = tf.expand_dims(y, axis=-1) # y.shape: (None, 10, 1)y = tf.expand_dims(y, axis=1) # y.shape: (None, 1, 10, 1)mask = tf.cast(y, dtype=tf.float32) # mask.shape: (None, 1, 10, 1)v_masked?=?tf.multiply(mask,?v)?#?v_masked.shape:?(None,?1,?10,?16

此v_masked被再發(fā)送到網(wǎng)絡(luò)重建并且被用于整個圖像的再生。重建網(wǎng)絡(luò)只是下面要點(diǎn)中顯示的3個密集層。


dense_1 = tf.keras.layers.Dense(units = 512, activation='relu')dense_2 = tf.keras.layers.Dense(units = 1024, activation='relu')dense_3 = tf.keras.layers.Dense(units = 784, activation='sigmoid', dtype='float32')
v_ = tf.reshape(v_masked, [-1, 10 * 16]) # v_.shape: (None, 160)reconstructed_image = dense_1(v_) # reconstructed_image.shape: (None, 512)reconstructed_image = dense_2(reconstructed_image) # reconstructed_image.shape: (None, 1024)reconstructed_image = dense_3(reconstructed_image) # reconstructed_image.shape: (None, 784)

我們將上述相同代碼轉(zhuǎn)換為繼承自的CapsuleNetwork類tf.keras.Model。也可以直接在自定義訓(xùn)練循環(huán)中使用該課程并進(jìn)行預(yù)測。

class CapsuleNetwork(tf.keras.Model):    def __init__(self, no_of_conv_kernels, no_of_primary_capsules, primary_capsule_vector, no_of_secondary_capsules, secondary_capsule_vector, r):        super(CapsuleNetwork, self).__init__()        self.no_of_conv_kernels = no_of_conv_kernels        self.no_of_primary_capsules = no_of_primary_capsules        self.primary_capsule_vector = primary_capsule_vector        self.no_of_secondary_capsules = no_of_secondary_capsules        self.secondary_capsule_vector = secondary_capsule_vector        self.r = r                        with tf.name_scope("Variables") as scope:            self.convolution = tf.keras.layers.Conv2D(self.no_of_conv_kernels, [9,9], strides=[1,1], name='ConvolutionLayer', activation='relu')            self.primary_capsule = tf.keras.layers.Conv2D(self.no_of_primary_capsules * self.primary_capsule_vector, [9,9], strides=[2,2], name="PrimaryCapsule")            self.w = tf.Variable(tf.random_normal_initializer()(shape=[1, 1152, self.no_of_secondary_capsules, self.secondary_capsule_vector, self.primary_capsule_vector]), dtype=tf.float32, name="PoseEstimation", trainable=True)            self.dense_1 = tf.keras.layers.Dense(units = 512, activation='relu')            self.dense_2 = tf.keras.layers.Dense(units = 1024, activation='relu')            self.dense_3 = tf.keras.layers.Dense(units = 784, activation='sigmoid', dtype='float32')            def build(self, input_shape):        pass            def squash(self, s):        with tf.name_scope("SquashFunction") as scope:            s_norm = tf.norm(s, axis=-1, keepdims=True)            return tf.square(s_norm)/(1 + tf.square(s_norm)) * s/(s_norm + epsilon)        @tf.function    def call(self, inputs):        input_x, y = inputs        # input_x.shape: (None, 28, 28, 1)        # y.shape: (None, 10)                x = self.convolution(input_x) # x.shape: (None, 20, 20, 256)        x = self.primary_capsule(x) # x.shape: (None, 6, 6, 256)                with tf.name_scope("CapsuleFormation") as scope:            u = tf.reshape(x, (-1, self.no_of_primary_capsules * x.shape[1] * x.shape[2], 8)) # u.shape: (None, 1152, 8)            u = tf.expand_dims(u, axis=-2) # u.shape: (None, 1152, 1, 8)            u = tf.expand_dims(u, axis=-1) # u.shape: (None, 1152, 1, 8, 1)            u_hat = tf.matmul(self.w, u) # u_hat.shape: (None, 1152, 10, 16, 1)            u_hat = tf.squeeze(u_hat, [4]) # u_hat.shape: (None, 1152, 10, 16)
                with tf.name_scope("DynamicRouting") as scope:            b = tf.zeros((input_x.shape[0], 1152, self.no_of_secondary_capsules, 1)) # b.shape: (None, 1152, 10, 1)            for i in range(self.r): # self.r = 3                c = tf.nn.softmax(b, axis=-2) # c.shape: (None, 1152, 10, 1)                s = tf.reduce_sum(tf.multiply(c, u_hat), axis=1, keepdims=True) # s.shape: (None, 1, 10, 16)                v = self.squash(s) # v.shape: (None, 1, 10, 16)                agreement = tf.squeeze(tf.matmul(tf.expand_dims(u_hat, axis=-1), tf.expand_dims(v, axis=-1), transpose_a=True), [4]) # agreement.shape: (None, 1152, 10, 1)                # Before matmul following intermediate shapes are present, they are not assigned to a variable but just for understanding the code.                # u_hat.shape (Intermediate shape) : (None, 1152, 10, 16, 1)                # v.shape (Intermediate shape): (None, 1, 10, 16, 1)                # Since the first parameter of matmul is to be transposed its shape becomes:(None, 1152, 10, 1, 16)                # Now matmul is performed in the last two dimensions, and others are broadcasted                # Before squeezing we have an intermediate shape of (None, 1152, 10, 1, 1)                b += agreement                        with tf.name_scope("Masking") as scope:            y = tf.expand_dims(y, axis=-1) # y.shape: (None, 10, 1)            y = tf.expand_dims(y, axis=1) # y.shape: (None, 1, 10, 1)            mask = tf.cast(y, dtype=tf.float32) # mask.shape: (None, 1, 10, 1)            v_masked = tf.multiply(mask, v) # v_masked.shape: (None, 1, 10, 16)                    with tf.name_scope("Reconstruction") as scope:            v_ = tf.reshape(v_masked, [-1, self.no_of_secondary_capsules * self.secondary_capsule_vector]) # v_.shape: (None, 160)            reconstructed_image = self.dense_1(v_) # reconstructed_image.shape: (None, 512)            reconstructed_image = self.dense_2(reconstructed_image) # reconstructed_image.shape: (None, 1024)            reconstructed_image = self.dense_3(reconstructed_image) # reconstructed_image.shape: (None, 784)                return v, reconstructed_image
    @tf.function    def predict_capsule_output(self, inputs):        x = self.convolution(inputs) # x.shape: (None, 20, 20, 256)        x = self.primary_capsule(x) # x.shape: (None, 6, 6, 256)                with tf.name_scope("CapsuleFormation") as scope:            u = tf.reshape(x, (-1, self.no_of_primary_capsules * x.shape[1] * x.shape[2], 8)) # u.shape: (None, 1152, 8)            u = tf.expand_dims(u, axis=-2) # u.shape: (None, 1152, 1, 8)            u = tf.expand_dims(u, axis=-1) # u.shape: (None, 1152, 1, 8, 1)            u_hat = tf.matmul(self.w, u) # u_hat.shape: (None, 1152, 10, 16, 1)            u_hat = tf.squeeze(u_hat, [4]) # u_hat.shape: (None, 1152, 10, 16)
                with tf.name_scope("DynamicRouting") as scope:            b = tf.zeros((inputs.shape[0], 1152, self.no_of_secondary_capsules, 1)) # b.shape: (None, 1152, 10, 1)            for i in range(self.r): # self.r = 3                c = tf.nn.softmax(b, axis=-2) # c.shape: (None, 1152, 10, 1)                s = tf.reduce_sum(tf.multiply(c, u_hat), axis=1, keepdims=True) # s.shape: (None, 1, 10, 16)                v = self.squash(s) # v.shape: (None, 1, 10, 16)                agreement = tf.squeeze(tf.matmul(tf.expand_dims(u_hat, axis=-1), tf.expand_dims(v, axis=-1), transpose_a=True), [4]) # agreement.shape: (None, 1152, 10, 1)                # Before matmul following intermediate shapes are present, they are not assigned to a variable but just for understanding the code.                # u_hat.shape (Intermediate shape) : (None, 1152, 10, 16, 1)                # v.shape (Intermediate shape): (None, 1, 10, 16, 1)                # Since the first parameter of matmul is to be transposed its shape becomes:(None, 1152, 10, 1, 16)                # Now matmul is performed in the last two dimensions, and others are broadcasted                # Before squeezing we have an intermediate shape of (None, 1152, 10, 1, 1)                b += agreement        return v
    @tf.function    def regenerate_image(self, inputs):        with tf.name_scope("Reconstruction") as scope:            v_ = tf.reshape(inputs, [-1, self.no_of_secondary_capsules * self.secondary_capsule_vector]) # v_.shape: (None, 160)            reconstructed_image = self.dense_1(v_) # reconstructed_image.shape: (None, 512)            reconstructed_image = self.dense_2(reconstructed_image) # reconstructed_image.shape: (None, 1024)            reconstructed_image = self.dense_3(reconstructed_image) # reconstructed_image.shape: (None, 784)        return reconstructed_image

讓我們檢查模型的摘要。

到這里我們已經(jīng)完成了模型架構(gòu)。該模型具有8215568個參數(shù)，與論文說的一樣，他們說重構(gòu)的模型具有8.2M個參數(shù)。但是，此博客具有8238608參數(shù)。差異的原因是TensorFlow僅考慮tf.Variable可訓(xùn)練參數(shù)中的資源。如果我們認(rèn)為1152 * 10 b和1152 * 10 c是可訓(xùn)練的，那么我們得到相同的數(shù)字。

8215568 + 11520 + 11520 =?8238608

其他詳細(xì)信息

我們將使用tf.GradientTape來查找漸變，并使用Adam優(yōu)化器。

def train(x,y):    y_one_hot = tf.one_hot(y, depth=10)    with tf.GradientTape() as tape:        v, reconstructed_image = model([x, y_one_hot])        loss = loss_function(v, reconstructed_image, y_one_hot, x)    grad = tape.gradient(loss, model.trainable_variables)    optimizer.apply_gradients(zip(grad, model.trainable_variables))    return loss

由于我們已使用來將類別細(xì)分為類別tf.keras.Model，因此我們可以簡單地調(diào)用model.trainable_variables并應(yīng)用漸變。


def predict(model, x):    pred = safe_norm(model.predict_capsule_output(x))    pred = tf.squeeze(pred, [1])    return np.argmax(pred, axis=1)[:,0]

這里做了一個自定義的預(yù)測函數(shù)，它將輸入圖像以及模型作為參數(shù)。發(fā)送模型作為參數(shù)的目的是，可以將檢查點(diǎn)模型稍后用于預(yù)測。

結(jié)果和特征可視化

該模型的訓(xùn)練精度為99％，測試精度為98％。但是，在某些檢查點(diǎn)中，準(zhǔn)確度為98.4％，而在其他一些檢查點(diǎn)中為97.7％。

在下面的要點(diǎn)中，是index_指測試集中的特定樣品編號，是index指樣品所y_test[index_]?代表的實際編號。


print(predict(model, tf.expand_dims(X_test[index_], axis=0)), y_test[index_])
features = model.predict_capsule_output(tf.expand_dims(X_test[index_], axis=0))
temp_features = features.numpy()temp_ = temp_features.copy()temp_features[:,:,:,:] = 0temp_features[:,:,index,:] = temp_[:,:,index,:]
recon = model.regenerate_image(temp_features)recon = tf.reshape(recon, (28,28))
plt.subplot(1,2,1)plt.imshow(recon, cmap='gray')plt.subplot(1,2,2)plt.imshow(X_test[index_,:,:,0], cmap='gray')

下面的代碼調(diào)整每個功能，并在[-0.25，0.25]范圍內(nèi)以0.05為增量進(jìn)行調(diào)整。在每個點(diǎn)上，都會生成圖像并將其存儲在數(shù)組中。因此，我們可以看到每個特征如何促進(jìn)圖像重建。

col = np.zeros((28,308))for i in range(16):     feature_ = temp_features.copy()    feature_[:,:,index, i] += -0.25    row = np.zeros((28,28))    for j in range(10):        feature_[:,:,index, i] += 0.05        row = np.hstack([row, tf.reshape(model.regenerate_image(tf.convert_to_tensor(feature_)), (28,28)).numpy()])    col = np.vstack([col, row])    plt.figure(figsize=(30,20))plt.imshow(col[28:, 28:], cmap='gray')

請參見下圖的一些重建示例。我們可以看到，某些功能控制著亮度，旋轉(zhuǎn)角度，厚度，偏斜度等。

結(jié)論

在本文中，我們試圖重現(xiàn)結(jié)果并可視化本文中描述的功能。訓(xùn)練精度為99％，測試精度幾乎為98％，這確實很棒。雖然，該模型需要花費(fèi)很多時間進(jìn)行訓(xùn)練，但是功能非常直觀。

在TensorFlow中實現(xiàn)膠囊網(wǎng)絡(luò)