欧美老逼色色,国产swag在线播放,中国女人性交毛片,国产成人精品免费视频,大香蕉伊在线观看视频,国产蜜臀AV,亚洲Aa俒日本,天堂网一区二区

分享一篇老文章，文本分類的原理和代碼詳解，非常適合NLP入門！

寫在前面

本文是對經(jīng)典論文《Convolutional Neural Networks for Sentence Classification^[1]》的詳細(xì)復(fù)現(xiàn)，(應(yīng)該是)基于TensorFlow 1.1以及python3.6。從數(shù)據(jù)預(yù)處理、模型搭建、模型訓(xùn)練預(yù)測以及可視化一條龍講解，旨在為剛接觸該領(lǐng)域不知道如何下手搭建網(wǎng)絡(luò)的同學(xué)提供一個(gè)參考。廢話不說直接進(jìn)入主題吧

NLP中的CNN

論文中是使用的CNN框架來實(shí)現(xiàn)對句子的分類，積極或者消極。當(dāng)然這里我們首先必須對CNN有個(gè)大概的了解，可以參考我之前的這篇【Deep learning】卷積神經(jīng)網(wǎng)絡(luò)CNN結(jié)構(gòu)。目前主流來看，CNN主要是應(yīng)用在computer vision領(lǐng)域，并且可以說由于CNN的出現(xiàn)，使得CV的研究與應(yīng)用都有了質(zhì)的飛躍。

目前對NLP的研究分析應(yīng)用最多的就是RNN系列的框架，比如RNN,GRU,LSTM等等，再加上Attention，基本可以認(rèn)為是NLP的標(biāo)配套餐了。但是在文本分類問題上，相比于RNN，CNN的構(gòu)建和訓(xùn)練更為簡單和快速，并且效果也不差，所以仍然會(huì)有一些研究。

那么，CNN到底是怎么應(yīng)用到NLP上的呢？

不同于CV輸入的圖像像素，NLP的輸入是一個(gè)個(gè)句子或者文檔。句子或文檔在輸入時(shí)經(jīng)過embedding（word2vec或者Glove）會(huì)被表示成向量矩陣，其中每一行表示一個(gè)詞語，行的總數(shù)是句子的長度，列的總數(shù)就是維度。例如一個(gè)包含十個(gè)詞語的句子，使用了100維的embedding，最后我們就有一個(gè)輸入為10x100的矩陣。

在CV中，filters是以一個(gè)patch（任意長度x任意寬度）的形式滑過遍歷整個(gè)圖像，但是在NLP中，filters會(huì)覆蓋到所有的維度，也就是形狀為 [filter_size, embed_size]。更為具體地理解可以看下圖，輸入為一個(gè)7x5的矩陣，filters的高度分別為2,3,4，寬度和輸入矩陣一樣為5。每個(gè)filter對輸入矩陣進(jìn)行卷積操作得到中間特征，然后通過pooling提取最大值，最終得到一個(gè)包含6個(gè)值的特征向量。

弄清楚了CNN的結(jié)構(gòu)，下面就可以開始實(shí)現(xiàn)文本分類任務(wù)了。

數(shù)據(jù)預(yù)處理

原論文中使用了好幾個(gè)數(shù)據(jù)集，這里我們只選擇其中的一個(gè)——Movie Review Data from Rotten Tomatoes^[2]。該數(shù)據(jù)集包括了10662個(gè)評論，其中一半positive一半negative。

在數(shù)據(jù)處理階段，主要包括以下幾個(gè)部分：

1、load file

def load_data_and_labels(positive_file, negative_file):
    #load data from files
    positive_examples = list(open(positive_file, "r", encoding='utf-8').readlines())
    positive_examples = [s.strip() for s in positive_examples]
    negative_examples = list(open(negative_file, "r", encoding='utf-8').readlines())
    negative_examples = [s.strip() for s in negative_examples]
    # Split by words
    x_text = positive_examples + negative_examples
    x_text = [clean_str(sent) for sent in x_text]
    # Generate labels
    positive_labels = [[0, 1] for _ in positive_examples]
    negative_labels = [[1, 0] for _ in negative_examples]
    y = np.concatenate([positive_labels, negative_labels], 0)
    return [x_text, y]

2、clean sentences

def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

模型實(shí)現(xiàn)

論文中使用的模型如下所示其中第一層為embedding layer，用于把單詞映射到一組向量表示。接下去是一層卷積層，使用了多個(gè)filters，這里有3,4,5個(gè)單詞一次遍歷。接著是一層max-pooling layer得到了一列長特征向量，然后在dropout 之后使用softmax得出每一類的概率。

在一個(gè)CNN類中實(shí)現(xiàn)上述模型

class TextCNN(object):
    """
    A CNN class for sentence classification
    With a embedding layer + a convolutional, max-pooling and softmax layer
    """
    def __init__(self, sequence_length, num_classes, vocab_size,
                 embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):
        """

        :param sequence_length: The length of our sentences
        :param num_classes:     Number of classes in the output layer(pos and neg)
        :param vocab_size:      The size of our vocabulary
        :param embedding_size:  The dimensionality of our embeddings.
        :param filter_sizes:    The number of words we want our convolutional filters to cover
        :param num_filters:     The number of filters per filter size
        :param l2_reg_lambda:   optional

這里再注釋一下filter_sizes和num_filters。filters_sizes是指filter每次處理幾個(gè)單詞，num_filters是指每個(gè)尺寸的處理包含幾個(gè)filter。

1. Input placeholder

tf.placeholder是tensorflow的一種占位符，與feeed_dict同時(shí)使用。在訓(xùn)練或者測試模型階段，我們可以通過feed_dict來喂入輸入變量。

# set placeholders for variables
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name='input_x')
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name='input_y')
self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')

tf.placeholder函數(shù)第一個(gè)參數(shù)是變量類型，第二個(gè)參數(shù)是變量shape，其中None表示sample的個(gè)數(shù)，第三個(gè)name參數(shù)用于指定名字。

dropout_keep_prob變量是在dropout階段使用的，我們在訓(xùn)練的時(shí)候選取50%的dropout，在測試時(shí)不使用dropout。

2. Embedding layer

我們需要定義的第一個(gè)層是embedding layer，用于將詞語轉(zhuǎn)變成為一組向量表示。

 # embedding layer
 with tf.name_scope('embedding'):
        self.W = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), name='weight')
        self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
        # TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor
        # with dimensions corresponding to batch, width, height and channel.
        self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

W 是在訓(xùn)練過程中學(xué)習(xí)到的參數(shù)矩陣，然后通過tf.nn.embedding_lookup來查找到與input_x相對應(yīng)的向量表示。tf.nn.embedding_lookup返回的結(jié)果是一個(gè)三維向量，[None, sequence_length, embedding_size]。但是后一層的卷積層要求輸入為四維向量（batch， width，height，channel）。所以我們要將結(jié)果擴(kuò)展一個(gè)維度，才能符合下一層的輸入。

3. Convolution and Max-Pooling Layers

在卷積層中最重要的就是filter?；仡櫛疚牡牡谝粡垐D，我們一共有三種類型的filter，每種類型有兩個(gè)。我們需要迭代每個(gè)filter去處理輸入矩陣，將最終得到的所有結(jié)果合并為一個(gè)大的特征向量。

# conv + max-pooling for each filter
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
    with tf.name_scope('conv-maxpool-%s' % filter_size):
        # conv layer
        filter_shape = [filter_size, embedding_size, 1, num_filters]
        W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name='W')
        b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name='b')
        conv = tf.nn.conv2d(self.embedded_chars_expanded, W, strides=[1,1,1,1],
                            padding='VALID', name='conv')
        # activation
        h = tf.nn.relu(tf.nn.bias_add(conv, b), name='relu')
        # max pooling
        pooled = tf.nn.max_pool(h, ksize=[1, sequence_length-filter_size + 1, 1, 1],
                                strides=[1,1,1,1], padding='VALID', name='pool')
        pooled_outputs.append(pooled)


# combine all the pooled fratures
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)  # why 3?
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

這里W 就是filter矩陣， tf.nn.conv2d是tensorflow的卷積操作函數(shù)，其中幾個(gè)參數(shù)包括

strides表示每一次filter滑動(dòng)的距離，它總是一個(gè)四維向量，而且首位和末尾必定要是1，[1, width, height, 1]。
padding有兩種取值：VALID和SAME。

VALID是指不在輸入矩陣周圍填充0，最后得到的output的尺寸小于input；
SAME是指在輸入矩陣周圍填充0，最后得到output的尺寸和input一樣；

這里我們使用的是‘VALID’，所以output的尺寸為[1, sequence_length - filter_size + 1, 1, 1]。

接下去是一層max-pooling，pooling比較好理解，就是選出其中最大的一個(gè)。經(jīng)過這一層的output尺寸為 [batch_size, 1, 1, num_filters]。

4. Dropout layer

這個(gè)比較好理解，就是為了防止模型的過擬合，設(shè)置了一個(gè)神經(jīng)元激活的概率。每次在dropout層設(shè)置一定概率使部分神經(jīng)元失效，每次失效的神經(jīng)元都不一樣，所以也可以認(rèn)為是一種bagging的效果。

# dropout
with tf.name_scope('dropout'):
    self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

5. Scores and Predictions

我們可以通過對上述得到的特征進(jìn)行運(yùn)算得到每個(gè)分類的分?jǐn)?shù)score，并且可以通過softmax將score轉(zhuǎn)化成概率分布，選取其中概率最大的一個(gè)作為最后的prediction

#score and prediction
with tf.name_scope("output"):
    W = tf.get_variable('W', shape=[num_filters_total, num_classes],
                        initializer = tf.contrib.layers.xavier_initializer())
    b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name='b')
    l2_loss += tf.nn.l2_loss(W)
    l2_loss += tf.nn.l2_loss(b)
    self.score = tf.nn.xw_plus_b(self.h_drop, W, b, name='scores')
    self.prediction = tf.argmax(self.score, 1, name='prediction')

6. Loss and Accuracy

通過score我們可以計(jì)算得出模型的loss，而我們訓(xùn)練的目的就是最小化這個(gè)loss。對于分類問題，最常用的損失函數(shù)是cross-entropy 損失

 # mean cross-entropy loss
with tf.name_scope('loss'):
    losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.score, labels=self.input_y)
    self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

為了在訓(xùn)練過程中實(shí)時(shí)觀測訓(xùn)練情況，我們可以定義一個(gè)準(zhǔn)確率

# accuracy
with tf.name_scope('accuracy'):
    correct_predictions = tf.equal(self.prediction, tf.argmax(self.input_y, 1))
    self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, 'float'), name='accuracy')

到目前為止，我們的模型框架已經(jīng)搭建完成，可以使用Tensorboardd來瞧一瞧到底是個(gè)啥樣

模型訓(xùn)練

接下去我們就要開始使用影評數(shù)據(jù)來訓(xùn)練網(wǎng)絡(luò)啦。

創(chuàng)建圖和session

對于Tensorflow有兩個(gè)重要的概念：Graph和Session。

Session會(huì)話可以理解為一個(gè)計(jì)算的環(huán)境，所有的operation只有在session中才能返回結(jié)果；
Graph圖就可以理解為上面那個(gè)圖片，在圖里面包含了所有要用到的操作operations和張量tensors。

PS：在一個(gè)項(xiàng)目中可以使用多個(gè)graph，不過我們一般習(xí)慣只用一個(gè)就行。同時(shí)，在一個(gè)graph中可以有多個(gè)session，但是在一個(gè)session中不能有多個(gè)graph。

with tf.Graph().as_default():
    session_conf = tf.ConfigProto(
        # allows TensorFlow to fall back on a device with a certain operation implemented
        allow_soft_placement= FLAGS.allow_soft_placement,
        # allows TensorFlow log on which devices (CPU or GPU) it places operations
        log_device_placement=FLAGS.log_device_placement
    )
    sess = tf.Session(config=session_conf)

Initialize CNN

cnn = TextCNN(sequence_length=x_train.shape[1],
              num_classes=y_train.shape[1],
              vocab_size= len(vocab_processor.vocabulary_),
              embedding_size=FLAGS.embedding_dim,
              filter_sizes= list(map(int, FLAGS.filter_sizes.split(','))),
              num_filters= FLAGS.num_filters,
              l2_reg_lambda= FLAGS.l2_reg_lambda)
global_step = tf.Variable(0, name='global_step', trainable=False)
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

這里train_op的作用就是更新參數(shù)，每運(yùn)行一次train_op，global_step都會(huì)增加1。

Summaries

Tensorflow有一個(gè)特別實(shí)用的操作，summary，它可以記錄訓(xùn)練時(shí)參數(shù)或者其他變量的變化情況并可視化到tensorboard。使用tf.summary.FileWriter()函數(shù)可以將summaries寫入到硬盤保存到本地。

# visualise gradient
grad_summaries = []
for g, v in grads_and_vars:
    if g is not None:
        grad_hist_summary = tf.summary.histogram('{}/grad/hist'.format(v.name),g)
        sparsity_summary = tf.summary.scalar('{}/grad/sparsity'.format(v.name), tf.nn.zero_fraction(g))
        grad_summaries.append(grad_hist_summary)
        grad_summaries.append(sparsity_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)

# output dir for models and summaries
timestamp = str(time.time())
out_dir = os.path.abspath(os.path.join(os.path.curdir, 'run', timestamp))
print('Writing to {} \n'.format(out_dir))

# summaries for loss and accuracy
loss_summary = tf.summary.scalar('loss', cnn.loss)
accuracy_summary = tf.summary.scalar('accuracy', cnn.accuracy)

# train summaries
train_summary_op = tf.summary.merge([loss_summary, accuracy_summary])
train_summary_dir = os.path.join(out_dir, 'summaries', 'train')
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

# dev summaries
dev_summary_op = tf.summary.merge([loss_summary, accuracy_summary])
dev_summary_dir = os.path.join(out_dir, 'summaries', 'dev')
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)

Checkpointing

checkpointing的作用就是可以保存每個(gè)階段訓(xùn)練模型的參數(shù)，然后我們可以根據(jù)準(zhǔn)確率來選取最好的一組參數(shù)。

checkpoint_dir = os.path.abspath(os.path.join(out_dir, 'checkpoints'))
checkpoint_prefix = os.path.join(checkpoint_dir, 'model')
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)

Initializing the variables

在開始訓(xùn)練之前，我們通常會(huì)需要初始化所有的變量。一般使用 tf.global_variables_initializer()就可以了。

Defining a single training step

我們可以定義一個(gè)單步訓(xùn)練的函數(shù)，使用一個(gè)batch的數(shù)據(jù)來更新模型的參數(shù)

def train_step(x_batch, y_batch):
    """
    A single training step
    :param x_batch:
    :param y_batch:
    :return:
    """
    feed_dict = {
        cnn.input_x: x_batch,
        cnn.input_y: y_batch,
        cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
    }
    _, step, summaries, loss, accuracy = sess.run(
        [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
        feed_dict=feed_dict
    )
    time_str = datetime.datetime.now().isoformat()
    print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
    train_summary_writer.add_summary(summaries, step)

這里的feed_dict就是我們前面提到的同placeholder一起使用的。必須在feed_dict中給出所有placeholder節(jié)點(diǎn)的值，否則程序就會(huì)報(bào)錯(cuò)。

接著使用sess.run（）運(yùn)行前面定義的操作，最終可以得到每一步的損失、準(zhǔn)確率這些信息。

類似地我們定義一個(gè)函數(shù)在驗(yàn)證集數(shù)據(jù)上看看模型的準(zhǔn)確率等

def dev_step(x_batch, y_batch, writer=None):
    """
    Evaluate model on a dev set
    Disable dropout
    :param x_batch:
    :param y_batch:
    :param writer:
    :return:
    """
    feed_dict = {
        cnn.input_x: x_batch,
        cnn.input_y: y_batch,
        cnn.dropout_keep_prob: 1.0
    }
    step, summaries, loss, accuracy = sess.run(
        [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
        feed_dict=feed_dict
    )
    time_str = datetime.datetime.now().isoformat()
    print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
    if writer:
        writer.add_summary(summaries, step)

Training loop

前面都定義好了以后就可以開始我們的訓(xùn)練了。我們每次調(diào)用train_step函數(shù)批量的訓(xùn)練數(shù)據(jù)并保存：

# generate batches
batches = data_process.batch_iter(list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
# training loop
for batch in batches:
    x_batch, y_batch = zip(*batch)
    train_step(x_batch, y_batch)
    current_step = tf.train.global_step(sess, global_step)
    if current_step % FLAGS.evaluate_every == 0:
        print('\n Evaluation:')
        dev_step(x_dev, y_dev, writer=dev_summary_writer)
        print('')
    if current_step % FLAGS.checkpoint_every == 0:
        path = saver.save(sess, checkpoint_prefix, global_step=current_step)
        print('Save model checkpoint to {} \n'.format(path))

最后輸出的效果大概是這樣的

Visualizing Results

我們可以在代碼目錄下打開終端輸入以下代碼來啟動(dòng)瀏覽器的tensorboard：

tensorboard --logdir /runs/xxxxxx/summaries

小結(jié)

當(dāng)然這只是一個(gè)利用CNN進(jìn)行NLP分類任務(wù)（文本分類，情感分析等）的baseline，可以看出準(zhǔn)確率并不是很高，后續(xù)還有很多可以優(yōu)化的地方，包括使用pre-trained的Word2vec向量、加上L2正則化等等。

完整代碼可以在公眾號后臺(tái)回復(fù)"CNN2014"獲取。

一起交流

想和你一起學(xué)習(xí)進(jìn)步！『NewBeeNLP』目前已經(jīng)建立了多個(gè)不同方向交流群（機(jī)器學(xué)習(xí) / 深度學(xué)習(xí) / 自然語言處理 / 搜索推薦 / 圖網(wǎng)絡(luò) / 面試交流 / 等），名額有限，趕緊添加下方微信加入一起討論交流吧?。ㄗ⒁庖欢ㄒ?strong>備注信息才能通過）

本文參考資料

[1]

Convolutional Neural Networks for Sentence Classification: https://arxiv.org/abs/1408.5882

[2]

Movie Review Data from Rotten Tomatoes: http://www.cs.cornell.edu/people/pabo/movie-review-data/

- END -

頭條+騰訊 NLP 面經(jīng)

2021-09-21

Sampled Softmax，你真的會(huì)用了嗎？

2021-08-23

周志華教授：如何做研究與寫論文？

2021-08-15

YYDS！一個(gè)針對中文的預(yù)訓(xùn)練模型

2021-08-06

【NLP保姆級教程】手把手帶你CNN文本分類(附代碼)