偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<sub id="5rozf"></sub>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

通透！NLP 中常用的八大詞嵌入技術(shù)

作者：程序員小寒 2024-07-15 08:13:12

BERT 是一種基于 Transformer 的模型，它通過雙向（即從左到右和從右到左）考慮整個(gè)句子來生成上下文感知的嵌入。與 Word2Vec 或 GloVe 等為每個(gè)單詞生成單一表示的傳統(tǒng)詞嵌入不同，BERT 根據(jù)其上下文為每個(gè)單詞生成不同的嵌入。

大家好，我是小寒。

今天給大家分享自然語言處理中常用的詞嵌入（Word embedding）技術(shù)

Word embedding 是自然語言處理（NLP）中的一種技術(shù)，用于將詞匯映射到連續(xù)向量空間，以便能夠更好地處理和分析文本數(shù)據(jù)。

這些向量（嵌入）能夠捕捉到詞匯之間的語義關(guān)系和上下文信息。

圖片

常用的 word embedding 技術(shù)

1.One-Hot Encoding

One-Hot Encoding 是最簡單的詞嵌入方法，將每個(gè)詞表示為一個(gè)詞匯表大小的向量，在該向量中，只有一個(gè)位置為1，其余位置為0。

優(yōu)點(diǎn)

簡單易實(shí)現(xiàn)。
沒有任何假設(shè)或?qū)W習(xí)過程。

缺點(diǎn)

維度非常高，詞匯表越大，向量維度越高。
不能捕捉詞匯之間的語義關(guān)系。
稀疏表示，效率低下。

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample data
corpus = ['dog', 'cat', 'dog', 'fish']

# Reshape data to fit the model
corpus = np.array(corpus).reshape(-1, 1)

# One-hot encode the data
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(corpus)

print(onehot_encoded)

#output
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

2.Bag of Words (BoW)

詞袋法 (BOW) 是自然語言處理 (NLP) 中的一種簡單技術(shù)，用于將文本文檔表示為數(shù)字向量。

其理念是將每個(gè)文檔視為一個(gè)單詞袋或單詞集合，然后計(jì)算文檔中每個(gè)單詞的頻率。

它不考慮單詞的順序，但提供了一種將文本轉(zhuǎn)換為向量的直接方法。

優(yōu)點(diǎn)

簡單易實(shí)現(xiàn)。
對小規(guī)模文本有效。

缺點(diǎn)

詞匯表大的情況下，向量維度高。
不能捕捉詞匯的順序和語義關(guān)系。
對常用詞和不常用詞一視同仁，不能區(qū)分重要詞匯。

from sklearn.feature_extraction.text import CountVectorizer

# Sample data
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

#output of the above code
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

3.TF-IDF

TF-IDF 是對 BoW 的改進(jìn)，它通過降低常用詞的權(quán)重同時(shí)增加稀有詞的權(quán)重來考慮單詞的重要性。

TF-IDF 背后的理念是通過考慮兩個(gè)因素來計(jì)算文檔中單詞的重要性：

詞頻 (TF)：這衡量了某個(gè)詞在文檔中出現(xiàn)的頻率。頻率越高，該詞對該文檔就越重要。
逆文檔頻率 (IDF)：它是衡量某個(gè)詞在語料庫中所有文檔的重要性的指標(biāo)。它基于這樣的直覺：出現(xiàn)在許多文檔中的單詞比出現(xiàn)在較少文檔中的單詞信息量更少。

公式：

TF：詞頻，表示詞 t 在文檔 d 中出現(xiàn)的次數(shù)。詞在文檔中出現(xiàn)的次數(shù)文檔中的總詞數(shù)
IDF：逆文檔頻率，衡量詞在整個(gè)語料庫中的稀有程度。文檔總數(shù)包含詞的文檔數(shù)
TF-IDF：TF 和 IDF 的乘積。

優(yōu)點(diǎn)

強(qiáng)調(diào)重要詞匯，減弱常見詞的影響。
適用于信息檢索和文本挖掘。

缺點(diǎn)

仍然是稀疏向量，維度高。
不能捕捉詞匯的順序和語義關(guān)系。

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

#output
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

4.Word2Vec

Word2Vec 是一種基于神經(jīng)網(wǎng)絡(luò)的模型，可生成單詞的密集向量表示。

Word2Vec 的基本思想是訓(xùn)練神經(jīng)網(wǎng)絡(luò)以預(yù)測給定目標(biāo)詞的上下文詞，然后使用生成的向量表示來捕獲單詞的語義。

它使用兩種主要方法捕獲單詞之間的語義關(guān)系：連續(xù)詞袋 (CBOW) 和 Skip-gram。

連續(xù)詞袋模型（CBOW）：根據(jù)周圍的上下文詞預(yù)測目標(biāo)詞。
Skip-Gram：根據(jù)目標(biāo)詞預(yù)測周圍的上下文詞。

圖片

優(yōu)點(diǎn)

能捕捉詞匯的語義關(guān)系。
生成的詞向量密集且維度較低。
在大規(guī)模語料庫上訓(xùn)練效果顯著。

缺點(diǎn)

需要大量語料進(jìn)行訓(xùn)練。
對計(jì)算資源要求較高。

from gensim.models import Word2Vec

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

5.GloVe

GloVe (Global Vectors for Word Representation) 是由斯坦福大學(xué)的研究人員在 2014 年提出的一種詞嵌入技術(shù)。

它結(jié)合了基于統(tǒng)計(jì)的全局矩陣分解方法和基于預(yù)測的局部上下文窗口方法，旨在通過捕捉詞對在大規(guī)模語料庫中的全局共現(xiàn)信息來學(xué)習(xí)詞向量。

GloVe 通過構(gòu)建一個(gè)詞對共現(xiàn)矩陣，并在此基礎(chǔ)上進(jìn)行矩陣分解來學(xué)習(xí)詞向量。共現(xiàn)矩陣的每個(gè)元素表示兩個(gè)詞在一定窗口范圍內(nèi)共同出現(xiàn)的次數(shù)。GloVe 模型試圖找到一個(gè)向量表示，使得兩個(gè)詞向量的點(diǎn)積能夠很好地近似它們在共現(xiàn)矩陣中的共現(xiàn)概率。

優(yōu)點(diǎn)

能捕捉詞匯的語義關(guān)系和全局統(tǒng)計(jì)信息。
生成的詞向量密集且維度較低。
對大規(guī)模語料庫有良好表現(xiàn)。

缺點(diǎn)

需要大量語料進(jìn)行訓(xùn)練。
對計(jì)算資源要求較高。

import gensim.downloader as api

# Download pre-trained GloVe model (choose the size you need - 50, 100, 200, or 300 dimensions)
glove_vectors = api.load("glove-wiki-gigaword-100")  # Example: 100-dimensional GloVe

# Get word vectors (embeddings)
word1 = "king"
word2 = "queen"
vector1 = glove_vectors[word1]
vector2 = glove_vectors[word2]

# Compute cosine similarity between the two word vectors
similarity = glove_vectors.similarity(word1, word2)

print(f"Word vectors for '{word1}': {vector1}")
print(f"Word vectors for '{word2}': {vector2}")
print(f"Cosine similarity between '{word1}' and '{word2}': {similarity}")

6.FastText

FastText 是由 Facebook 的 AI 研究團(tuán)隊(duì)開發(fā)的一種詞嵌入技術(shù)。

它是 Word2Vec 的擴(kuò)展，主要特點(diǎn)是將詞分解為子詞（subword）進(jìn)行表示，從而能夠更好地處理詞匯外單詞（OOV）和拼寫錯(cuò)誤的詞。

FastText 的核心思想是將每個(gè)詞分解成一組子詞或 n-gram，然后學(xué)習(xí)這些子詞的向量表示。通過子詞的組合來表示整個(gè)詞，能夠更好地捕捉詞的內(nèi)部結(jié)構(gòu)信息。

優(yōu)點(diǎn)

處理詞匯外單詞：由于利用了子詞信息，F(xiàn)astText 能夠很好地處理詞匯表之外的新詞。
更好的泛化能力：能夠捕捉詞的內(nèi)部結(jié)構(gòu)信息，提升詞嵌入的泛化能力。
高效：在大規(guī)模數(shù)據(jù)上訓(xùn)練速度快，并且生成的詞向量質(zhì)量高。

缺點(diǎn)

比 Word2Vec 維度更高

from gensim.models import FastText

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

7.ELMo

ELMo 是由 AllenNLP 團(tuán)隊(duì)開發(fā)的一種上下文相關(guān)的詞嵌入技術(shù)。

與傳統(tǒng)的詞嵌入方法不同，ELMo 生成的詞向量依賴于上下文，并且在同一個(gè)句子中，同一個(gè)詞在不同位置的嵌入向量是不同的。

ELMo 使用雙向 LSTM 語言模型，從文本中學(xué)習(xí)詞的上下文表示。通過預(yù)訓(xùn)練語言模型，然后在特定任務(wù)上進(jìn)行微調(diào)，生成動(dòng)態(tài)的上下文相關(guān)的詞嵌入。

圖片

優(yōu)點(diǎn)

上下文相關(guān)：能夠捕捉詞匯在不同上下文中的不同含義。
適應(yīng)性強(qiáng)：在多個(gè) NLP 任務(wù)中表現(xiàn)優(yōu)異，包括命名實(shí)體識(shí)別（NER）、問答系統(tǒng)等。

import tensorflow as tf
import tensorflow_hub as hub

# Load pre-trained ELMo model from TensorFlow Hub
elmo = hub.load("https://tfhub.dev/google/elmo/3")

# Sample data
sentences = ["This is the first document.", "This document is the second document."]

def elmo_vectors(sentences):
    embeddings = elmo.signatures['default'](tf.constant(sentences))['elmo']
    return embeddings

# Get ELMo embeddings
elmo_embeddings = elmo_vectors(sentences)
print(elmo_embeddings)

8.BERT

BERT 是一種基于 Transformer 的模型，它通過雙向（即從左到右和從右到左）考慮整個(gè)句子來生成上下文感知的嵌入。

與 Word2Vec 或 GloVe 等為每個(gè)單詞生成單一表示的傳統(tǒng)詞嵌入不同，BERT 根據(jù)其上下文為每個(gè)單詞生成不同的嵌入。

優(yōu)點(diǎn)

上下文雙向編碼：能夠同時(shí)捕捉詞匯的前后文信息。
預(yù)訓(xùn)練和微調(diào)：通過預(yù)訓(xùn)練大規(guī)模語言模型，并在特定任務(wù)上微調(diào)，顯著提升模型性能。
廣泛適用性：在多個(gè) NLP 任務(wù)中表現(xiàn)出色，如問答系統(tǒng)、文本分類、命名實(shí)體識(shí)別等。

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample data
sentence = "This is the first document."

# Tokenize input
inputs = tokenizer(sentence, return_tensors='pt')

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(embeddings)

責(zé)任編輯：武曉燕來源：程序員學(xué)長

BERT 模型嵌入

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營