偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<wbr id="3c88o"></wbr>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認證

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創(chuàng)認證華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

使用TENSORFLOW訓練循環(huán)神經(jīng)網(wǎng)絡(luò)語言模型

作者：佚名 2017-08-28 21:31:37

人工智能深度學習

Language Model，即語言模型，其主要思想是，在知道前一部分的詞的情況下，推斷出下一個最有可能出現(xiàn)的詞。在本文中，我們更加關(guān)注的是，如何使用RNN來推測下一個詞。

讀了將近一個下午的TensorFlow Recurrent Neural Network教程，翻看其在PTB上的實現(xiàn)，感覺晦澀難懂，因此參考了部分代碼，自己寫了一個簡化版的Language Model，思路借鑒了Keras的LSTM text generation。

代碼地址：Github

轉(zhuǎn)載請注明出處：Gaussic

語言模型

Language Model，即語言模型，其主要思想是，在知道前一部分的詞的情況下，推斷出下一個最有可能出現(xiàn)的詞。例如，知道了 The fat cat sat on the，我們認為下一個詞為mat的可能性比hat要大，因為貓更有可能坐在毯子上，而不是帽子上。

這可能被你認為是常識，但是在自然語言處理中，這個任務(wù)是可以用概率統(tǒng)計模型來描述的。就拿The fat cat sat on the mat來說。我們可能統(tǒng)計出第一個詞The出現(xiàn)的概率p(The)p(The)，The后面是fat的條件概率為p(fat|The)p(fat|The)，The fat同時出現(xiàn)的聯(lián)合概率：

p(The,fat)=p(The)⋅p(fat|The)p(The,fat)=p(The)·p(fat|The)

這個聯(lián)合概率，就是The fat的合理性，即這句話的出現(xiàn)符不符合自然語言的評判標準，通俗點表述就是這是不是句人話。同理，根據(jù)鏈式規(guī)則，The fat cat的聯(lián)合概率可求：

p(The,fat,cat)=p(The)⋅p(fat|The)⋅p(cat|The,fat)p(The,fat,cat)=p(The)·p(fat|The)·p(cat|The,fat)

在知道前面的詞為The cat的情況下，下一個詞為cat的概率可以推導(dǎo)出來：

p(cat|The,fat)=p(The,fat,cat)p(The,fat)p(cat|The,fat)=p(The,fat,cat)p(The,fat)

分子是The fat cat在語料庫中出現(xiàn)的次數(shù)，分母是The fat在語料庫中出現(xiàn)的次數(shù)。

因此，The fat cat sat on the mat整個句子的合理性同樣可以推導(dǎo)，這個句子的合理性即為它的概率。公式化的描述如下：

p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅⋅⋅p(wn|w1,w2,w3,⋅⋅⋅,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)···p(wn|w1,w2,w3,···,wn−1)

（公式后的n-1應(yīng)該為下標，插件問題，下同）

可以看出一個問題，每當計算下一個詞的條件概率，需要計算前面所有詞的聯(lián)合概率。這個計算量相當?shù)凝嫶?。并且，一個句子中大部分詞同時出現(xiàn)的概率往往少之又少，數(shù)據(jù)稀疏非常嚴重，需要一個非常大的語料庫來訓練。

一個簡單的優(yōu)化是基于馬爾科夫假設(shè)，下一個詞的出現(xiàn)僅與前面的一個或n個詞有關(guān)。

最簡單的情況，下一個詞的出現(xiàn)僅僅和前面一個詞有關(guān)，稱之為bigram。

p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w2)⋅p(w4|w3)⋅⋅⋅p(wn|wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w2)·p(w4|w3)···p(wn|wn−1)

再復(fù)雜點，下一個詞的出現(xiàn)僅和前面兩個詞有關(guān)，稱之為trigram。

p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅p(w4|w2,w3)⋅⋅⋅p(wn|wn−2,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)·p(w4|w2,w3)···p(wn|wn−2,wn−1)

這樣的條件概率雖然好求，但是會丟失大量的前面的詞的信息，有時會對結(jié)果產(chǎn)生不良影響。因此如何選擇一個有效的n，使得既能簡化計算，又能保留大部分的上下文信息。

以上均是傳統(tǒng)語言模型的描述。如果不太深究細節(jié)，我們的任務(wù)就是，知道前面n個詞，來計算下一個詞出現(xiàn)的概率。并且使用語言模型來生成新的文本。

在本文中，我們更加關(guān)注的是，如何使用RNN來推測下一個詞。

數(shù)據(jù)準備

TensorFlow的官方文檔使用的是Mikolov準備好的PTB數(shù)據(jù)集。我們可以將其下載并解壓出來：

$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 
$ tar xvf simple-examples.tgz

部分數(shù)據(jù)如下，不常用的詞轉(zhuǎn)換成了<unk>標記，數(shù)字轉(zhuǎn)換成了N：

we 're talking about years ago before anyone heard of asbestos having any questionable properties 
there is no asbestos in our products now 
neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes 
we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute 
the total of N deaths from malignant <unk> lung cancer and <unk> was far higher than expected the researchers said

讀取文件中的數(shù)據(jù)，將換行符轉(zhuǎn)換為<eos>，然后轉(zhuǎn)換為詞的list：

def _read_words(filename): 
    with open(filename, 'r', encoding='utf-8') as f: 
        return f.read().replace('\n', '<eos>').split()

f = _read_words('simple-examples/data/ptb.train.txt') 
print(f[:20])

得到：

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']

構(gòu)建詞匯表，詞與id互轉(zhuǎn)：

def _build_vocab(filename): 
    data = _read_words(filename) 
 
    counter = Counter(data) 
    count_pairs = sorted(counter.items(), key=lambda x: -x[1]) 
 
    words, _ = list(zip(*count_pairs)) 
    word_to_id = dict(zip(words, range(len(words)))) 
 
    return words, word_to_id

words, words_to_id = _build_vocab('simple-examples/data/ptb.train.txt') 
print(words[:10]) 
print(list(map(lambda x: words_to_id[x], words[:10])))

輸出：

('the', '<unk>', '<eos>', 'N', 'of', 'to', 'a', 'in', 'and', "'s") 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

將一個文件轉(zhuǎn)換為id表示：

def _file_to_word_ids(filename, word_to_id): 
    data = _read_words(filename) 
    return [word_to_id[x] for x in data if x in word_to_id]

words_in_file = _file_to_word_ids('simple-examples/data/ptb.train.txt', words_to_id) 
print(words_in_file[:20])

詞匯表已根據(jù)詞頻進行排序，由于第一句話非英文，所以id靠后。

[9980, 9988, 9981, 9989, 9970, 9998, 9971, 9979, 9992, 9997, 9982, 9972, 9993, 9991, 9978, 9983, 9974, 9986, 9999, 9990]

將一句話從id列表轉(zhuǎn)換回詞：

def to_words(sentence, words): 
    return list(map(lambda x: words[x], sentence))

將以上函數(shù)整合：

def ptb_raw_data(data_path=None): 
    train_path = os.path.join(data_path, 'ptb.train.txt') 
    valid_path = os.path.join(data_path, 'ptb.valid.txt') 
    test_path = os.path.join(data_path, 'ptb.test.txt') 
 
    words, word_to_id = _build_vocab(train_path) 
    train_data = _file_to_word_ids(train_path, word_to_id) 
    valid_data = _file_to_word_ids(valid_path, word_to_id) 
    test_data = _file_to_word_ids(test_path, word_to_id) 
 
    return train_data, valid_data, test_data, words, word_to_id

以上部分和官方的例子有一定的相似之處。接下來的處理和官方存在很大的不同，主要參考了Keras例程處理文檔的操作：

def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1): 
    data_len = len(raw_data) 
 
    sentences = [] 
    next_words = [] 
    for i in range(0, data_len - num_steps, stride): 
        sentences.append(raw_data[i:(i + num_steps)]) 
        next_words.append(raw_data[i + num_steps]) 
 
    sentences = np.array(sentences) 
    next_words = np.array(next_words) 
 
    batch_len = len(sentences) // batch_size 
    x = np.reshape(sentences[:(batch_len * batch_size)], \ 
        [batch_len, batch_size, -1]) 
 
    y = np.reshape(next_words[:(batch_len * batch_size)], \ 
        [batch_len, batch_size]) 
 
    return x, y

參數(shù)解析：

raw_data: 即ptb_raw_data()函數(shù)產(chǎn)生的數(shù)據(jù)
batch_size: 神經(jīng)網(wǎng)絡(luò)使用隨機梯度下降，數(shù)據(jù)按多個批次輸出，此為每個批次的數(shù)據(jù)量
num_steps: 每個句子的長度，相當于之前描述的n的大小，這在循環(huán)神經(jīng)網(wǎng)絡(luò)中又稱為時序的長度。
stride: 取數(shù)據(jù)的步長，決定數(shù)據(jù)量的大小。

代碼解析：

這個函數(shù)將一個原始數(shù)據(jù)list轉(zhuǎn)換為多個批次的數(shù)據(jù)，即[batch_len, batch_size, num_steps]。

首先，程序每一次取了num_steps個詞作為一個句子，即x，以這num_steps個詞后面的一個詞作為它的下一個預(yù)測，即為y。這樣，我們首先把原始數(shù)據(jù)整理成了batch_len * batch_size個x和y的表示，類似于已知x求y的分類問題。

為了滿足隨機梯度下降的需要，我們還需要把數(shù)據(jù)整理成一個個小的批次，每次喂一個批次的數(shù)據(jù)給TensorFlow來更新權(quán)重，這樣，數(shù)據(jù)就整理為[batch_len, batch_size, num_steps]的格式。

打印部分數(shù)據(jù)：

train_data, valid_data, test_data, words, word_to_id = ptb_raw_data('simple-examples/data') 
x_train, y_train = ptb_producer(train_data) 
print(x_train.shape) 
print(y_train.shape)

輸出：

(14524, 64, 20) 
(14524, 64)

可見我們得到了14524個批次的數(shù)據(jù)，每個批次的訓練集維度為[64, 20]。

print(' '.join(to_words(x_train[100, 3], words)))

第100個批次的第3句話為：

despite steady sales growth <eos> magna recently cut its quarterly dividend in half and the company 's class a shares

print(words[np.argmax(y_train[100, 3])])

它的下一個詞為：

the

構(gòu)建模型

配置項

class LMConfig(object): 
    """language model 配置項""" 
    batch_size = 64       # 每一批數(shù)據(jù)的大小 
    num_steps = 20        # 每一個句子的長度 
    stride = 3            # 取數(shù)據(jù)時的步長 
 
    embedding_dim = 64    # 詞向量維度 
    hidden_dim = 128      # RNN隱藏層維度 
    num_layers = 2        # RNN層數(shù) 
 
    learning_rate = 0.05  # 學習率 
    dropout = 0.2         # 每一層后的丟棄概率

讀取輸入

讓模型可以按批次的讀取數(shù)據(jù)。

class PTBInput(object): 
    """按批次讀取數(shù)據(jù)""" 
    def __init__(self, config, data): 
        self.batch_size = config.batch_size 
        self.num_steps = config.num_steps 
        self.vocab_size = config.vocab_size # 詞匯表大小 
 
        self.input_data, self.targets = ptb_producer(data, 
            self.batch_size, self.num_steps) 
 
        self.batch_len = self.input_data.shape[0] # 總批次 
        self.cur_batch = 0  # 當前批次 
 
    def next_batch(self): 
        """讀取下一批次""" 
        x = self.input_data[self.cur_batch] 
        y = self.targets[self.cur_batch] 
 
        # 轉(zhuǎn)換為one-hot編碼 
        y_ = np.zeros((y.shape[0], self.vocab_size), dtype=np.bool) 
        for i in range(y.shape[0]): 
            y_[i][y[i]] = 1 
 
        # 如果到最后一個批次，則回到最開頭 
        self.cur_batch = (self.cur_batch +1) % self.batch_len 
 
        return x, y_

模型

class PTBModel(object): 
    def __init__(self, config, is_training=True): 
 
        self.num_steps = config.num_steps 
        self.vocab_size = config.vocab_size 
 
        self.embedding_dim = config.embedding_dim 
        self.hidden_dim = config.hidden_dim 
        self.num_layers = config.num_layers 
        self.rnn_model = config.rnn_model 
 
        self.learning_rate = config.learning_rate 
        self.dropout = config.dropout 
 
        self.placeholders()  # 輸入占位符 
        self.rnn()           # rnn 模型構(gòu)建 
        self.cost()          # 代價函數(shù) 
        self.optimize()      # 優(yōu)化器 
        self.error()         # 錯誤率 
 
 
    def placeholders(self): 
        """輸入數(shù)據(jù)的占位符""" 
        self._inputs = tf.placeholder(tf.int32, [None, self.num_steps]) 
        self._targets = tf.placeholder(tf.int32, [None, self.vocab_size]) 
 
 
    def input_embedding(self): 
        """將輸入轉(zhuǎn)換為詞向量表示""" 
        with tf.device("/cpu:0"): 
            embedding = tf.get_variable( 
                "embedding", [self.vocab_size, 
                    self.embedding_dim], dtype=tf.float32) 
            _inputs = tf.nn.embedding_lookup(embedding, self._inputs) 
 
        return _inputs 
 
 
    def rnn(self): 
        """rnn模型構(gòu)建""" 
        def lstm_cell():  # 基本的lstm cell 
            return tf.contrib.rnn.BasicLSTMCell(self.hidden_dim, 
                state_is_tuple=True) 
 
        def gru_cell():   # gru cell，速度更快 
            return tf.contrib.rnn.GRUCell(self.hidden_dim) 
 
        def dropout_cell():    # 在每個cell后添加dropout 
            if (self.rnn_model == 'lstm'): 
                cell = lstm_cell() 
            else: 
                cell = gru_cell() 
            return tf.contrib.rnn.DropoutWrapper(cell, 
                output_keep_prob=self.dropout) 
 
        cells = [dropout_cell() for _ in range(self.num_layers)] 
        cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True)  # 多層rnn 
 
        _inputs = self.input_embedding() 
        _outputs, _ = tf.nn.dynamic_rnn(cell=cell, 
            inputs=_inputs, dtype=tf.float32) 
 
        # _outputs的shape為 [batch_size, num_steps, hidden_dim] 
        last = _outputs[:, -1, :]  # 只需要最后一個輸出 
 
        # dense 和 softmax 用于分類，以找出各詞的概率 
        logits = tf.layers.dense(inputs=last, units=self.vocab_size)    
        prediction = tf.nn.softmax(logits)   
 
        self._logits = logits 
        self._pred = prediction 
 
    def cost(self): 
        """計算交叉熵代價函數(shù)""" 
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits( 
            logits=self._logits, labels=self._targets) 
        cost = tf.reduce_mean(cross_entropy) 
        self.cost = cost 
 
    def optimize(self): 
        """使用adam優(yōu)化器""" 
        optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate) 
        self.optim = optimizer.minimize(self.cost) 
 
    def error(self): 
        """計算錯誤率""" 
        mistakes = tf.not_equal( 
            tf.argmax(self._targets, 1), tf.argmax(self._pred, 1)) 
        self.errors = tf.reduce_mean(tf.cast(mistakes, tf.float32))

訓練

def run_epoch(num_epochs=10): 
    config = LMConfig()   # 載入配置項 
 
    # 載入源數(shù)據(jù)，這里只需要訓練集 
    train_data, _, _, words, word_to_id = \ 
        ptb_raw_data('simple-examples/data') 
    config.vocab_size = len(words) 
 
    # 數(shù)據(jù)分批 
    input_train = PTBInput(config, train_data) 
    batch_len = input_train.batch_len 
 
    # 構(gòu)建模型 
    model = PTBModel(config) 
 
    # 創(chuàng)建session，初始化變量 
    sess = tf.Session() 
    sess.run(tf.global_variables_initializer()) 
 
    print('Start training...') 
    for epoch in range(num_epochs):  # 迭代輪次 
        for i in range(batch_len):   # 經(jīng)過多少個batch 
            x_batch, y_batch = input_train.next_batch() 
 
            # 取一個批次的數(shù)據(jù)，運行優(yōu)化 
            feed_dict = {model._inputs: x_batch, model._targets: y_batch} 
            sess.run(model.optim, feed_dict=feed_dict) 
 
            # 每500個batch，輸出一次中間結(jié)果 
            if i % 500 == 0: 
                cost = sess.run(model.cost, feed_dict=feed_dict) 
 
                msg = "Epoch: {0:>3}, batch: {1:>6}, Loss: {2:>6.3}" 
                print(msg.format(epoch + 1, i + 1, cost)) 
 
                # 輸出部分預(yù)測結(jié)果 
                pred = sess.run(model._pred, feed_dict=feed_dict) 
                word_ids = sess.run(tf.argmax(pred, 1)) 
                print('Predicted:', ' '.join(words[w] for w in word_ids)) 
                true_ids = np.argmax(y_batch, 1) 
                print('True:', ' '.join(words[w] for w in true_ids)) 
 
    print('Finish training...') 
    sess.close()

需要經(jīng)過多次的訓練才能得到一個較為合理的結(jié)果。

責任編輯：龐桂玉來源： 36大數(shù)據(jù)

TensorFlow 深度學習神經(jīng)網(wǎng)絡(luò)

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<ruby id="o44uc"></ruby>