使用TENSORFLOW訓練循環(huán)神經網絡語言模型
讀了將近一個下午的TensorFlow Recurrent Neural Network教程,翻看其在PTB上的實現(xiàn),感覺晦澀難懂,因此參考了部分代碼,自己寫了一個簡化版的Language Model,思路借鑒了Keras的LSTM text generation。
代碼地址:Github
轉載請注明出處:Gaussic
語言模型
Language Model,即語言模型,其主要思想是,在知道前一部分的詞的情況下,推斷出下一個最有可能出現(xiàn)的詞。例如,知道了 The fat cat sat on the,我們認為下一個詞為mat的可能性比hat要大,因為貓更有可能坐在毯子上,而不是帽子上。
這可能被你認為是常識,但是在自然語言處理中,這個任務是可以用概率統(tǒng)計模型來描述的。就拿The fat cat sat on the mat來說。我們可能統(tǒng)計出第一個詞The出現(xiàn)的概率p(The)p(The),The后面是fat的條件概率為p(fat|The)p(fat|The),The fat同時出現(xiàn)的聯(lián)合概率:
- p(The,fat)=p(The)⋅p(fat|The)p(The,fat)=p(The)·p(fat|The)
 
- p(The,fat,cat)=p(The)⋅p(fat|The)⋅p(cat|The,fat)p(The,fat,cat)=p(The)·p(fat|The)·p(cat|The,fat)
 
- p(cat|The,fat)=p(The,fat,cat)p(The,fat)p(cat|The,fat)=p(The,fat,cat)p(The,fat)
 
- p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅⋅⋅p(wn|w1,w2,w3,⋅⋅⋅,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)···p(wn|w1,w2,w3,···,wn−1)
 
- p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w2)⋅p(w4|w3)⋅⋅⋅p(wn|wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w2)·p(w4|w3)···p(wn|wn−1)
 
- p(S)=p(w1,w2,⋅⋅⋅,wn)=p(w1)⋅p(w2|w1)⋅p(w3|w1,w2)⋅p(w4|w2,w3)⋅⋅⋅p(wn|wn−2,wn−1)p(S)=p(w1,w2,···,wn)=p(w1)·p(w2|w1)·p(w3|w1,w2)·p(w4|w2,w3)···p(wn|wn−2,wn−1)
 
這樣的條件概率雖然好求,但是會丟失大量的前面的詞的信息,有時會對結果產生不良影響。因此如何選擇一個有效的n,使得既能簡化計算,又能保留大部分的上下文信息。
以上均是傳統(tǒng)語言模型的描述。如果不太深究細節(jié),我們的任務就是,知道前面n個詞,來計算下一個詞出現(xiàn)的概率。并且使用語言模型來生成新的文本。
在本文中,我們更加關注的是,如何使用RNN來推測下一個詞。
- $ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
 - $ tar xvf simple-examples.tgz
 
- we 're talking about years ago before anyone heard of asbestos having any questionable properties
 - there is no asbestos in our products now
 - neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes
 - we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute
 - the total of N deaths from malignant <unk> lung cancer and <unk> was far higher than expected the researchers said
 
- def _read_words(filename):
 - with open(filename, 'r', encoding='utf-8') as f:
 - return f.read().replace('\n', '<eos>').split()
 
- f = _read_words('simple-examples/data/ptb.train.txt')
 - print(f[:20])
 
- ['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']
 
- def _build_vocab(filename):
 - data = _read_words(filename)
 - counter = Counter(data)
 - count_pairs = sorted(counter.items(), key=lambda x: -x[1])
 - words, _ = list(zip(*count_pairs))
 - word_to_id = dict(zip(words, range(len(words))))
 - return words, word_to_id
 
- words, words_to_id = _build_vocab('simple-examples/data/ptb.train.txt')
 - print(words[:10])
 - print(list(map(lambda x: words_to_id[x], words[:10])))
 
- ('the', '<unk>', '<eos>', 'N', 'of', 'to', 'a', 'in', 'and', "'s")
 - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
 
- def _file_to_word_ids(filename, word_to_id):
 - data = _read_words(filename)
 - return [word_to_id[x] for x in data if x in word_to_id]
 
- words_in_file = _file_to_word_ids('simple-examples/data/ptb.train.txt', words_to_id)
 - print(words_in_file[:20])
 
- [9980, 9988, 9981, 9989, 9970, 9998, 9971, 9979, 9992, 9997, 9982, 9972, 9993, 9991, 9978, 9983, 9974, 9986, 9999, 9990]
 
- def to_words(sentence, words):
 - return list(map(lambda x: words[x], sentence))
 
- def ptb_raw_data(data_path=None):
 - train_path = os.path.join(data_path, 'ptb.train.txt')
 - valid_path = os.path.join(data_path, 'ptb.valid.txt')
 - test_path = os.path.join(data_path, 'ptb.test.txt')
 - words, word_to_id = _build_vocab(train_path)
 - train_data = _file_to_word_ids(train_path, word_to_id)
 - valid_data = _file_to_word_ids(valid_path, word_to_id)
 - test_data = _file_to_word_ids(test_path, word_to_id)
 - return train_data, valid_data, test_data, words, word_to_id
 
- def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1):
 - data_len = len(raw_data)
 - sentences = []
 - next_words = []
 - for i in range(0, data_len - num_steps, stride):
 - sentences.append(raw_data[i:(i + num_steps)])
 - next_words.append(raw_data[i + num_steps])
 - sentences = np.array(sentences)
 - next_words = np.array(next_words)
 - batch_len = len(sentences) // batch_size
 - x = np.reshape(sentences[:(batch_len * batch_size)], \
 - [batch_len, batch_size, -1])
 - y = np.reshape(next_words[:(batch_len * batch_size)], \
 - [batch_len, batch_size])
 - return x, y
 
- raw_data: 即ptb_raw_data()函數(shù)產生的數(shù)據(jù)
 - batch_size: 神經網絡使用隨機梯度下降,數(shù)據(jù)按多個批次輸出,此為每個批次的數(shù)據(jù)量
 - num_steps: 每個句子的長度,相當于之前描述的n的大小,這在循環(huán)神經網絡中又稱為時序的長度。
 - stride: 取數(shù)據(jù)的步長,決定數(shù)據(jù)量的大小。
 
代碼解析:
這個函數(shù)將一個原始數(shù)據(jù)list轉換為多個批次的數(shù)據(jù),即[batch_len, batch_size, num_steps]。
首先,程序每一次取了num_steps個詞作為一個句子,即x,以這num_steps個詞后面的一個詞作為它的下一個預測,即為y。這樣,我們首先把原始數(shù)據(jù)整理成了batch_len * batch_size個x和y的表示,類似于已知x求y的分類問題。
為了滿足隨機梯度下降的需要,我們還需要把數(shù)據(jù)整理成一個個小的批次,每次喂一個批次的數(shù)據(jù)給TensorFlow來更新權重,這樣,數(shù)據(jù)就整理為[batch_len, batch_size, num_steps]的格式。
打印部分數(shù)據(jù):
- train_data, valid_data, test_data, words, word_to_id = ptb_raw_data('simple-examples/data')
 - x_train, y_train = ptb_producer(train_data)
 - print(x_train.shape)
 - print(y_train.shape)
 
- (14524, 64, 20)
 - (14524, 64)
 
- print(' '.join(to_words(x_train[100, 3], words)))
 
- despite steady sales growth <eos> magna recently cut its quarterly dividend in half and the company 's class a shares
 
- print(words[np.argmax(y_train[100, 3])])
 
- the
 
- class LMConfig(object):
 - """language model 配置項"""
 - batch_size = 64 # 每一批數(shù)據(jù)的大小
 - num_steps = 20 # 每一個句子的長度
 - stride = 3 # 取數(shù)據(jù)時的步長
 - embedding_dim = 64 # 詞向量維度
 - hidden_dim = 128 # RNN隱藏層維度
 - num_layers = 2 # RNN層數(shù)
 - learning_rate = 0.05 # 學習率
 - dropout = 0.2 # 每一層后的丟棄概率
 
- class PTBInput(object):
 - """按批次讀取數(shù)據(jù)"""
 - def __init__(self, config, data):
 - self.batch_size = config.batch_size
 - self.num_steps = config.num_steps
 - self.vocab_size = config.vocab_size # 詞匯表大小
 - self.input_data, self.targets = ptb_producer(data,
 - self.batch_size, self.num_steps)
 - self.batch_len = self.input_data.shape[0] # 總批次
 - self.cur_batch = 0 # 當前批次
 - def next_batch(self):
 - """讀取下一批次"""
 - x = self.input_data[self.cur_batch]
 - y = self.targets[self.cur_batch]
 - # 轉換為one-hot編碼
 - y_ = np.zeros((y.shape[0], self.vocab_size), dtype=np.bool)
 - for i in range(y.shape[0]):
 - y_[i][y[i]] = 1
 - # 如果到最后一個批次,則回到最開頭
 - self.cur_batch = (self.cur_batch +1) % self.batch_len
 - return x, y_
 
- class PTBModel(object):
 - def __init__(self, config, is_training=True):
 - self.num_steps = config.num_steps
 - self.vocab_size = config.vocab_size
 - self.embedding_dim = config.embedding_dim
 - self.hidden_dim = config.hidden_dim
 - self.num_layers = config.num_layers
 - self.rnn_model = config.rnn_model
 - self.learning_rate = config.learning_rate
 - self.dropout = config.dropout
 - self.placeholders() # 輸入占位符
 - self.rnn() # rnn 模型構建
 - self.cost() # 代價函數(shù)
 - self.optimize() # 優(yōu)化器
 - self.error() # 錯誤率
 - def placeholders(self):
 - """輸入數(shù)據(jù)的占位符"""
 - self._inputs = tf.placeholder(tf.int32, [None, self.num_steps])
 - self._targets = tf.placeholder(tf.int32, [None, self.vocab_size])
 - def input_embedding(self):
 - """將輸入轉換為詞向量表示"""
 - with tf.device("/cpu:0"):
 - embedding = tf.get_variable(
 - "embedding", [self.vocab_size,
 - self.embedding_dim], dtype=tf.float32)
 - _inputs = tf.nn.embedding_lookup(embedding, self._inputs)
 - return _inputs
 - def rnn(self):
 - """rnn模型構建"""
 - def lstm_cell(): # 基本的lstm cell
 - return tf.contrib.rnn.BasicLSTMCell(self.hidden_dim,
 - state_is_tuple=True)
 - def gru_cell(): # gru cell,速度更快
 - return tf.contrib.rnn.GRUCell(self.hidden_dim)
 - def dropout_cell(): # 在每個cell后添加dropout
 - if (self.rnn_model == 'lstm'):
 - cell = lstm_cell()
 - else:
 - cell = gru_cell()
 - return tf.contrib.rnn.DropoutWrapper(cell,
 - output_keep_prob=self.dropout)
 - cells = [dropout_cell() for _ in range(self.num_layers)]
 - cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True) # 多層rnn
 - _inputs = self.input_embedding()
 - _outputs, _ = tf.nn.dynamic_rnn(cell=cell,
 - inputs=_inputs, dtype=tf.float32)
 - # _outputs的shape為 [batch_size, num_steps, hidden_dim]
 - last = _outputs[:, -1, :] # 只需要最后一個輸出
 - # dense 和 softmax 用于分類,以找出各詞的概率
 - logits = tf.layers.dense(inputs=last, units=self.vocab_size)
 - prediction = tf.nn.softmax(logits)
 - self._logits = logits
 - self._pred = prediction
 - def cost(self):
 - """計算交叉熵代價函數(shù)"""
 - cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
 - logits=self._logits, labels=self._targets)
 - cost = tf.reduce_mean(cross_entropy)
 - self.cost = cost
 - def optimize(self):
 - """使用adam優(yōu)化器"""
 - optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
 - self.optim = optimizer.minimize(self.cost)
 - def error(self):
 - """計算錯誤率"""
 - mistakes = tf.not_equal(
 - tf.argmax(self._targets, 1), tf.argmax(self._pred, 1))
 - self.errors = tf.reduce_mean(tf.cast(mistakes, tf.float32))
 
- def run_epoch(num_epochs=10):
 - config = LMConfig() # 載入配置項
 - # 載入源數(shù)據(jù),這里只需要訓練集
 - train_data, _, _, words, word_to_id = \
 - ptb_raw_data('simple-examples/data')
 - config.vocab_size = len(words)
 - # 數(shù)據(jù)分批
 - input_train = PTBInput(config, train_data)
 - batch_len = input_train.batch_len
 - # 構建模型
 - model = PTBModel(config)
 - # 創(chuàng)建session,初始化變量
 - sess = tf.Session()
 - sess.run(tf.global_variables_initializer())
 - print('Start training...')
 - for epoch in range(num_epochs): # 迭代輪次
 - for i in range(batch_len): # 經過多少個batch
 - x_batch, y_batch = input_train.next_batch()
 - # 取一個批次的數(shù)據(jù),運行優(yōu)化
 - feed_dict = {model._inputs: x_batch, model._targets: y_batch}
 - sess.run(model.optim, feed_dict=feed_dict)
 - # 每500個batch,輸出一次中間結果
 - if i % 500 == 0:
 - cost = sess.run(model.cost, feed_dict=feed_dict)
 - msg = "Epoch: {0:>3}, batch: {1:>6}, Loss: {2:>6.3}"
 - print(msg.format(epoch + 1, i + 1, cost))
 - # 輸出部分預測結果
 - pred = sess.run(model._pred, feed_dict=feed_dict)
 - word_ids = sess.run(tf.argmax(pred, 1))
 - print('Predicted:', ' '.join(words[w] for w in word_ids))
 - true_ids = np.argmax(y_batch, 1)
 - print('True:', ' '.join(words[w] for w in true_ids))
 - print('Finish training...')
 - sess.close()
 
需要經過多次的訓練才能得到一個較為合理的結果。















 
 
 















 
 
 
 