偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認證

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創(chuàng)認證華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

租房有深坑？手把手教你如何用R速讀評論+科學選房

作者：大數(shù)據(jù)文摘 2018-09-02 15:15:30

大數(shù)據(jù) 數(shù)據(jù)分析

要想租到稱心如意的房子，不僅要眼明手快，還得看清各類“前輩”的評價避開大坑。一位程序員在出行選酒店的時候就借用了程序工具：先用python爬下了海外點評網(wǎng)站TripAdvisor的數(shù)千評論，并且用R進行了文本分析和情感分析，科學選房，高效便捷，極具參考價值。

大數(shù)據(jù)文摘出品

編譯：Hope、臻臻、CoolBoy

最近，租房這事兒成了北漂族的一大bug，要想租到稱心如意的房子，不僅要眼明手快，還得看清各類“前輩”的評價避開大坑。一位程序員在出行選酒店的時候就借用了程序工具：先用python爬下了海外點評網(wǎng)站TripAdvisor的數(shù)千評論，并且用R進行了文本分析和情感分析，科學選房，高效便捷，極具參考價值。

以下，這份超詳實的教程拿好不謝。

TripAdvisor提供的信息對于旅行者的出行決策非常重要。但是，要去了解TripAdvisor的泡沫評分和數(shù)千個評論文本之間的細微差別是極具挑戰(zhàn)性的。

為了更加全面地了解酒店旅客的評論是否會對之后酒店的服務產(chǎn)生影響，我爬取了TripAdvisor中一個名為Hilton Hawaiian Village酒店的所有英文評論。這里我不會對爬蟲的細節(jié)進行展開。

Python源碼：

https://github.com/susanli2016/NLP-with-Python/blob/master/Web%20scraping%20Hilton%20Hawaiian%20Village%20TripAdvisor%20Reviews.py

加載擴展包

library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
library(tidytext)
library(tidyverse)
library(stringr)
library(tidyr)
library(scales)
library(broom)
library(purrr)
library(widyr)
library(igraph)
library(ggraph)
library(SnowballC)
library(wordcloud)
library(reshape2)
theme_set(theme_minimal())

數(shù)據(jù)集

df <- read_csv("Hilton_Hawaiian_Village_Waikiki_Beach_Resort-Honolulu_Oahu_Hawaii__en.csv") 
df <- df[complete.cases(df), ] 
df$review_date <- as.Date(df$review_date, format = "%d-%B-%y") 
 
dim(df); min(df$review_date); max(df$review_date)

Figure 2

我們在TripAdvisor上一共獲得了13,701條關于Hilton Hawaiian Village酒店的英文評論，這些評論的時間范圍是從2002–03–21 到2018–08–02。

df %>% 
  count(Week = round_date(review_date, "week")) %>% 
  ggplot(aes(Week, n)) + 
  geom_line() +  
  ggtitle('The Number of Reviews Per Week')

Figure 2

在2014年末，周評論數(shù)量達到最高峰。那一個星期里酒店被評論了70次。

對評論文本進行文本挖掘

df <- tibble::rowid_to_column(df, "ID") 
df <- df %>% 
  mutate(review_date = as.POSIXct(review_date, origin = "1970-01-01"),month = round_date(review_date, "month")) 
 
review_words <- df %>% 
  distinct(review_body, .keep_all = TRUE) %>% 
  unnest_tokens(word, review_body, drop = FALSE) %>% 
  distinct(ID, word, .keep_all = TRUE) %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(str_detect(word, "[^\\d]")) %>% 
  group_by(word) %>% 
  mutate(word_total = n()) %>% 
  ungroup() 
 
word_counts <- review_words %>% 
  count(word, sort = TRUE) 
 
word_counts %>% 
  head(25) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) + 
  geom_col(fill = "lightblue") + 
  scale_y_continuous(labels = comma_format()) + 
  coord_flip() + 
  labs(title = "Most common words in review text 2002 to date", 
       subtitle = "Among 13,701 reviews; stop words removed", 
       y = "# of uses")

Figure 3

我們還可以更進一步的把“stay”和“stayed”，“pool”和“pools”這些意思相近的詞合并起來。這個步驟被稱為詞干提取，也就是將變形(或是衍生)詞語縮減為詞干，基詞或根詞的過程。

word_counts %>% 
  head(25) %>% 
  mutate(word = wordStem(word)) %>%  
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) + 
  geom_col(fill = "lightblue") + 
  scale_y_continuous(labels = comma_format()) + 
  coord_flip() + 
  labs(title = "Most common words in review text 2002 to date", 
       subtitle = "Among 13,701 reviews; stop words removed and stemmed", 
       y = "# of uses")

Figure 4

二元詞組

通常我們希望了解評論中單詞的相互關系。哪些詞組在評論文本中比較常用呢?如果給出一列單詞，那么后面會隨之出現(xiàn)什么單詞呢?哪些詞之間的關聯(lián)性最強?許多有意思的文本挖掘都是基于這些關系的。在研究兩個連續(xù)單詞的時候，我們稱這些單詞對為“二元詞組”。

所以，在Hilton Hawaiian Village的評論中，哪些是最常見的二元詞組呢?

review_bigrams <- df %>% 
  unnest_tokens(bigram, review_body, token = "ngrams", n = 2) 
 
bigrams_separated <- review_bigrams %>% 
  separate(bigram, c("word1", "word2"), sep = " ") 
 
bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) 
 
bigram_counts <- bigrams_filtered %>%  
  count(word1, word2, sort = TRUE) 
 
bigrams_united <- bigrams_filtered %>% 
  unite(bigram, word1, word2, sep = " ") 
 
bigrams_united %>% 
  count(bigram, sort = TRUE)

Figure 5

最常見的二元詞組是“rainbow tower”(彩虹塔)，其次是“hawaiian village”(夏威夷村)。

我們可以利用網(wǎng)絡可視化來展示這些二元詞組：

review_subject <- df %>%  
  unnest_tokens(word, review_body) %>%  
  anti_join(stop_words) 
 
my_stopwords <- data_frame(word = c(as.character(1:10))) 
review_subject <- review_subject %>%  
  anti_join(my_stopwords) 
 
title_word_pairs <- review_subject %>%  
  pairwise_count(word, ID, sort = TRUE, upper = FALSE) 
 
set.seed(1234) 
title_word_pairs %>% 
  filter(n >= 1000) %>% 
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") + 
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") + 
  geom_node_point(size = 5) + 
  geom_node_text(aes(label = name), repel = TRUE,  
                 point.padding = unit(0.2, "lines")) + 
  ggtitle('Word network in TripAdvisor reviews') 
  theme_void()

Figure 6

上圖展示了TripAdvisor評論中較為常見的二元詞組。這些詞至少出現(xiàn)了1000次，而且其中不包含停用詞。

在網(wǎng)絡圖中我們發(fā)現(xiàn)出現(xiàn)頻率最高的幾個詞存在很強的相關性(“hawaiian”, “village”, “ocean” 和“view”)，不過我們沒有發(fā)現(xiàn)明顯的聚集現(xiàn)象。

三元詞組

二元詞組有時候還不足以說明情況，讓我們來看看TripAdvisor中關于Hilton Hawaiian Village酒店最常見的三元詞組有哪些。

review_trigrams <- df %>% 
  unnest_tokens(trigram, review_body, token = "ngrams", n = 3) 
 
trigrams_separated <- review_trigrams %>% 
  separate(trigram, c("word1", "word2", "word3"), sep = " ") 
 
trigrams_filtered <- trigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word3 %in% stop_words$word) 
 
trigram_counts <- trigrams_filtered %>%  
  count(word1, word2, word3, sort = TRUE) 
 
trigrams_united <- trigrams_filtered %>% 
  unite(trigram, word1, word2, word3, sep = " ") 
 
trigrams_united %>% 
  count(trigram, sort = TRUE)

Figure 7

最常見的三元詞組是“hilton hawaiian village”，其次是“diamond head tower”，等等。

評論中關鍵單詞的趨勢

隨著時間的推移，哪些單詞或話題變得更加常見，或者更加罕見了呢?從這些信息我們可以探知酒店做出的調(diào)整，比如在服務上，翻新上，解決問題上。我們還可以預測哪些主題會更多地被提及。

我們想要解決類似這樣的問題：隨著時間的推移，在TripAdvisor的評論區(qū)中哪些詞出現(xiàn)的頻率越來越高了?

reviews_per_month <- df %>% 
  group_by(month) %>% 
  summarize(month_total = n()) 
 
word_month_counts <- review_words %>% 
  filter(word_total >= 1000) %>% 
  count(word, month) %>% 
  complete(word, month, fill = list(n = 0)) %>% 
  inner_join(reviews_per_month, by = "month") %>% 
  mutate(percent = n / month_total) %>% 
  mutate(yearyear = year(month) + yday(month) / 365) 
 
mod <- ~ glm(cbind(n, month_total - n) ~ year, ., family = "binomial") 
 
slopes <- word_month_counts %>% 
  nest(-word) %>% 
  mutate(model = map(data, mod)) %>% 
  unnest(map(model, tidy)) %>% 
  filter(term == "year") %>% 
  arrange(desc(estimate)) 
 
slopes %>% 
  head(9) %>% 
  inner_join(word_month_counts, by = "word") %>% 
  mutate(word = reorder(word, -estimate)) %>% 
  ggplot(aes(month, n / month_total, color = word)) + 
  geom_line(show.legend = FALSE) + 
  scale_y_continuous(labels = percent_format()) + 
  facet_wrap(~ word, scales = "free_y") + 
  expand_limits(y = 0) + 
  labs(x = "Year", 
       y = "Percentage of reviews containing this word", 
       title = "9 fastest growing words in TripAdvisor reviews", 
       subtitle = "Judged by growth rate over 15 years")

Figure 8

在2010年以前我們可以看到大家討論的焦點是“friday fireworks”(周五的煙花)和“lagoon”(環(huán)礁湖)。而在2005年以前“resort fee”(度假費)和“busy”(繁忙)這些詞的詞頻增長最快。

評論區(qū)中哪些詞的詞頻在下降呢?

slopes %>% 
  tail(9) %>% 
  inner_join(word_month_counts, by = "word") %>% 
  mutate(word = reorder(word, estimate)) %>% 
  ggplot(aes(month, n / month_total, color = word)) + 
  geom_line(show.legend = FALSE) + 
  scale_y_continuous(labels = percent_format()) + 
  facet_wrap(~ word, scales = "free_y") + 
  expand_limits(y = 0) + 
  labs(x = "Year", 
       y = "Percentage of reviews containing this term", 
       title = "9 fastest shrinking words in TripAdvisor reviews", 
       subtitle = "Judged by growth rate over 4 years")

Figure 9

這張圖展示了自2010年以來逐漸變少的主題。這些詞包括“hhv” (我認為這是 hilton hawaiian village的簡稱), “breakfast”(早餐), “upgraded”(升級)， “prices”(價格) and “free”(免費)。

讓我們對一些單詞進行比較。

word_month_counts %>% 
  filter(word %in% c("service", "food")) %>% 
  ggplot(aes(month, n / month_total, color = word)) + 
  geom_line(size = 1, alpha = .8) + 
  scale_y_continuous(labels = percent_format()) + 
  expand_limits(y = 0) + 
  labs(x = "Year", 
       y = "Percentage of reviews containing this term", title = "service vs food in terms of reviewers interest")

Figure 10

在2010年之前，服務(service)和食物(food)都是熱點主題。關于服務和食物的討論在2003年到達頂峰，自2005年之后就一直在下降，只是偶爾會反彈。

情感分析

情感分析被廣泛應用于對評論、調(diào)查、網(wǎng)絡和社交媒體文本的分析，以反映客戶的感受，涉及范圍包括市場營銷、客戶服務和臨床醫(yī)學等。

在本案例中，我們的目標是對評論者(也就是酒店旅客)在住店之后對酒店的態(tài)度進行分析。這個態(tài)度可能是一個判斷或是評價。

下面來看評論中出現(xiàn)得最頻繁的積極詞匯和消極詞匯。

reviews <- df %>%  
  filter(!is.na(review_body)) %>%  
  select(ID, review_body) %>%  
  group_by(row_number()) %>%  
  ungroup() 
tidy_reviews <- reviews %>% 
  unnest_tokens(word, review_body) 
tidy_reviews <- tidy_reviews %>% 
  anti_join(stop_words) 
 
bing_word_counts <- tidy_reviews %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup() 
 
bing_word_counts %>% 
  group_by(sentiment) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~sentiment, scales = "free") + 
  labs(y = "Contribution to sentiment", x = NULL) + 
  coord_flip() +  
  ggtitle('Words that contribute to positive and negative sentiment in the reviews')

Figure 11

讓我們換一個情感文本庫，看看結(jié)果是否一樣。

contributions <- tidy_reviews %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  group_by(word) %>% 
  summarize(occurences = n(), 
            contribution = sum(score)) 
contributions %>% 
  top_n(25, abs(contribution)) %>% 
  mutate(word = reorder(word, contribution)) %>% 
  ggplot(aes(word, contribution, fill = contribution > 0)) + 
  ggtitle('Words with the greatest contributions to positive/negative  
          sentiment in reviews') + 
  geom_col(show.legend = FALSE) + 
  coord_flip()

Figure 12

有意思的是，“diamond”(出自“diamond head-鉆石頭”)被歸類為積極情緒。

這里其實有一個潛在問題，比如“clean”(干凈)是什么詞性取決于語境。如果前面有個“not”(不)，這就是一個消極情感了。事實上一元詞在否定詞(如not)存在的時候經(jīng)常碰到這種問題，這就引出了我們下一個話題：

在情感分析中使用二元詞組來辨明語境

我們想知道哪些詞經(jīng)常前面跟著“not”(不)

bigrams_separated %>% 
  filter(word1 == "not") %>% 
  count(word1, word2, sort = TRUE)

Figure 13

“a”前面跟著“not”的情況出現(xiàn)了850次，而“the”前面跟著“not”出現(xiàn)了698次。不過，這種結(jié)果不是特別有實際意義。

AFINN <- get_sentiments("afinn") 
not_words <- bigrams_separated %>% 
  filter(word1 == "not") %>% 
  inner_join(AFINN, by = c(word2 = "word")) %>% 
  count(word2, score, sort = TRUE) %>% 
  ungroup() 
 
not_words

Figure 14

上面的分析告訴我們，在“not”后面最常見的情感詞匯是“worth”，其次是“recommend”，這些詞都被認為是積極詞匯，而且積極程度得分為2。

所以在我們的數(shù)據(jù)中，哪些單詞最容易被誤解為相反的情感?

not_words %>% 
  mutate(contribution = n * score) %>% 
  arrange(desc(abs(contribution))) %>% 
  head(20) %>% 
  mutate(word2 = reorder(word2, contribution)) %>% 
  ggplot(aes(word2, n * score, fill = n * score > 0)) + 
  geom_col(show.legend = FALSE) + 
  xlab("Words preceded by \"not\"") + 
  ylab("Sentiment score * number of occurrences") + 
  ggtitle('The 20 words preceded by "not" that had the greatest contribution to  
          sentiment scores, positive or negative direction') + 
  coord_flip()

Figure 15

二元詞組“not worth”, “not great”, “not good”, “not recommend”和“not like”是導致錯誤判斷的最大根源，使得評論看起來比原來積極的多。

除了“not”以外，還有其他的否定詞會對后面的內(nèi)容進行情緒的扭轉(zhuǎn)，比如“no”, “never” 和“without”。讓我們來看一下具體情況。

negation_words <- c("not", "no", "never", "without") 
 
negated_words <- bigrams_separated %>% 
  filter(word1 %in% negation_words) %>% 
  inner_join(AFINN, by = c(word2 = "word")) %>% 
  count(word1, word2, score, sort = TRUE) %>% 
  ungroup() 
 
negated_words %>% 
  mutate(contribution = n * score, 
         word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>% 
  group_by(word1) %>% 
  top_n(12, abs(contribution)) %>% 
  ggplot(aes(word2, contribution, fill = n * score > 0)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~ word1, scales = "free") + 
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) + 
  xlab("Words preceded by negation term") + 
  ylab("Sentiment score * # of occurrences") + 
  ggtitle('The most common positive or negative words to follow negations  
          such as "no", "not", "never" and "without"') + 
  coord_flip()

Figure 16

看來導致錯判為積極詞匯的最大根源來自于“not worth/great/good/recommend”，而另一方面錯判為消極詞匯的最大根源是“not bad” 和“no problem”。

最后，讓我們來觀察一下最積極和最消極的評論。

sentiment_messages <- tidy_reviews %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  group_by(ID) %>% 
  summarize(sentiment = mean(score), 
            words = n()) %>% 
  ungroup() %>% 
  filter(words >= 5) 
 
sentiment_messages %>% 
  arrange(desc(sentiment))

Figure 17

最積極的評論來自于ID為2363的記錄：“哇哇哇，這地方太好了!從房間我們可以看到很漂亮的景色，我們住得很開心。Hilton酒店就是很棒!無論是小孩還是大人，這家酒店有著所有你想要的東西。”

df[ which(df$ID==2363), ]$review_body[1]

Figure 18

sentiment_messages %>% 
  arrange(sentiment)

Figure 19

最消極的評論來自于ID為3748的記錄：“(我)住了5晚(16年5月12日-5月17日)。第一晚，我們發(fā)現(xiàn)地磚壞了，小孩子在玩手指。第二晚，我們看到小蟑螂在兒童食物上爬。前臺給我們換了房間，但他們讓我們一小時之內(nèi)搬好房間，否則就不能換房。。。已經(jīng)晚上11點，我們都很累了，孩子們也睡了。我們拒絕了這個建議。退房的時候，前臺小姐跟我講，蟑螂在他們的旅館里很常見。她還反問我在加州見不到蟑螂嗎?我沒想到能在Hilton遇到這樣的事情。”

df[ which(df$ID==3748), ]$review_body[1]

Figure 20

Github源碼：

https://github.com/susanli2016/Data-Analysis-with-R/blob/master/Text%20Mining%20Hilton%20Hawaiian%20Village%20TripAdvisor%20Reviews.Rmd

相關報道：

https://towardsdatascience.com/scraping-tripadvisor-text-mining-and-sentiment-analysis-for-hotel-reviews-cc4e20aef333

【本文是51CTO專欄機構(gòu)大數(shù)據(jù)文摘的原創(chuàng)文章，微信公眾號“大數(shù)據(jù)文摘（ id: BigDataDigest）”】

戳這里，看該作者更多好文

責任編輯：趙寧寧來源： 51CTO專欄

R python 租房

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<em id="d0wij"><tt id="d0wij"></tt></em>