偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<abbr id="4zw2e"><var id="4zw2e"></var></abbr>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

Tomcat是怎樣處理搜索引擎爬蟲請(qǐng)求的?

作者：侯樹成 2018-06-24 08:53:42

開發(fā) 開發(fā)工具

每個(gè)置身于互聯(lián)網(wǎng)中的站點(diǎn)，都需要搜索引擎的收錄，以及在適時(shí)在結(jié)果中的展現(xiàn)，從而將信息提供給用戶、讀者。而搜索引擎如何才能收錄我們的站點(diǎn)呢?

每個(gè)置身于互聯(lián)網(wǎng)中的站點(diǎn)，都需要搜索引擎的收錄，以及在適時(shí)在結(jié)果中的展現(xiàn)，從而將信息提供給用戶、讀者。而搜索引擎如何才能收錄我們的站點(diǎn)呢?

這就涉及到一個(gè)「搜索引擎的爬蟲」爬取站點(diǎn)內(nèi)容的過(guò)程。只有被搜索引擎爬過(guò)并收錄的內(nèi)容才有機(jī)會(huì)在特定query***之后在結(jié)果中展現(xiàn)。

這些搜索引擎內(nèi)容的工具，又被稱為爬蟲、Sprider，Web crawler 等等。我們一方面歡迎其訪問(wèn)站點(diǎn)以便收錄內(nèi)容，一方面又因其對(duì)于正常服務(wù)的影響頭疼。畢竟 Spider 也是要占用服務(wù)器資源的， Spider 太多太頻繁的資源占用，正常用戶請(qǐng)求處理就會(huì)受到影響。所以一些站點(diǎn)干脆直接為搜索引擎提供了單獨(dú)的服務(wù)供其訪問(wèn)，其他正常的用戶請(qǐng)求走另外的服務(wù)器。

說(shuō)到這里需要提一下，對(duì)于是否是 Spider 的請(qǐng)求識(shí)別，是通過(guò)HTTP 請(qǐng)求頭中的User-Agent 字段來(lái)判斷的，每個(gè)搜索引擎有自己的獨(dú)立標(biāo)識(shí)。而且通過(guò)這些內(nèi)容，管理員也可以在訪問(wèn)日志中了解搜索引擎爬過(guò)哪些內(nèi)容。

此外，在對(duì)搜索引擎的「爬取聲明文件」robots.txt中，也會(huì)有類似的User-agent 描述。比如下面是taobao 的robots.txt描述

User-agent:  Baiduspider 
Allow:  /article 
Allow:  /oshtml 
Disallow:  /product/ 
Disallow:  / 
 
User-Agent:  Googlebot 
Allow:  /article 
Allow:  /oshtml 
Allow:  /product 
Allow:  /spu 
Allow:  /dianpu 
Allow:  /oversea 
Allow:  /list 
Disallow:  / 
 
User-agent:  Bingbot 
Allow:  /article 
Allow:  /oshtml 
Allow:  /product 
Allow:  /spu 
Allow:  /dianpu 
Allow:  /oversea 
Allow:  /list 
Disallow:  / 
 
User-Agent:  360Spider 
Allow:  /article 
Allow:  /oshtml 
Disallow:  / 
 
User-Agent:  Yisouspider 
Allow:  /article 
Allow:  /oshtml 
Disallow:  / 
 
User-Agent:  Sogouspider 
Allow:  /article 
Allow:  /oshtml 
Allow:  /product 
Disallow:  / 
 
User-Agent:  Yahoo!  Slurp 
Allow:  /product 
Allow:  /spu 
Allow:  /dianpu 
Allow:  /oversea 
Allow:  /list 
Disallow:  /

我們?cè)賮?lái)看 Tomcat對(duì)于搜索引擎的請(qǐng)求做了什么特殊處理呢?

對(duì)于請(qǐng)求涉及到 Session，我們知道通過(guò) Session，我們?cè)诜?wù)端得以識(shí)別一個(gè)具體的用戶。那 Spider 的大量請(qǐng)求到達(dá)后，如果訪問(wèn)頻繁同時(shí)請(qǐng)求量大時(shí)，就需要?jiǎng)?chuàng)建巨大量的 Session，需要占用和消耗很多內(nèi)存，這無(wú)形中占用了正常用戶處理的資源。

為此， Tomcat 提供了一個(gè) 「Valve」，用于對(duì) Spider 的請(qǐng)求做一些處理。

首先識(shí)別 Spider 請(qǐng)求，對(duì)于 Spider 請(qǐng)求，使其使用相同的 SessionId繼續(xù)后面的請(qǐng)求流程，從而避免創(chuàng)建大量的 Session 數(shù)據(jù)。

這里需要注意，即使Spider顯式的傳了一個(gè) sessionId過(guò)來(lái)，也會(huì)棄用，而是根據(jù)client Ip 來(lái)進(jìn)行判斷，即對(duì)于相同的 Spider 只提供一個(gè)Session。

我們來(lái)看代碼：

// If the incoming request has a valid session ID, no action is required 
if (request.getSession(false) == null) { 
 
    // Is this a crawler - check the UA headers 
    Enumeration<String> uaHeaders = request.getHeaders("user-agent"); 
    String uaHeader = null; 
    if (uaHeaders.hasMoreElements()) { 
        uaHeader = uaHeaders.nextElement(); 
    } 
 
    // If more than one UA header - assume not a bot 
    if (uaHeader != null && !uaHeaders.hasMoreElements()) { 
        if (uaPattern.matcher(uaHeader).matches()) { 
            isBot = true; 
            if (log.isDebugEnabled()) { 
                log.debug(request.hashCode() + 
                        ": Bot found. UserAgent=" + uaHeader); 
            } 
        } 
    } 
 
    // If this is a bot, is the session ID known? 
    if (isBot) { 
        clientIp = request.getRemoteAddr(); 
        sessionId = clientIpSessionId.get(clientIp); 
        if (sessionId != null) { 
            request.setRequestedSessionId(sessionId); // 重用session 
        } 
    } 
} 
 
getNext().invoke(request, response); 
 
if (isBot) { 
    if (sessionId == null) { 
        // Has bot just created a session, if so make a note of it 
        HttpSession s = request.getSession(false); 
        if (s != null) { 
            clientIpSessionId.put(clientIp, s.getId()); //針對(duì)Spider生成session 
            sessionIdClientIp.put(s.getId(), clientIp); 
            // #valueUnbound() will be called on session expiration 
            s.setAttribute(this.getClass().getName(), this); 
            s.setMaxInactiveInterval(sessionInactiveInterval); 
 
            if (log.isDebugEnabled()) { 
                log.debug(request.hashCode() + 
                        ": New bot session. SessionID=" + s.getId()); 
            } 
        } 
    } else { 
        if (log.isDebugEnabled()) { 
            log.debug(request.hashCode() + 
                    ": Bot session accessed. SessionID=" + sessionId); 
        } 
    } 
}

判斷Spider 是通過(guò)正則

private String crawlerUserAgents = 
    ".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*"; 
// 初始化Valve的時(shí)候進(jìn)行compile 
uaPattern = Pattern.compile(crawlerUserAgents);

這樣當(dāng) Spider 到達(dá)的時(shí)候就能通過(guò) User-agent識(shí)別出來(lái)并進(jìn)行特別處理從而減小受其影響。

這個(gè) Valve的名字是：「CrawlerSessionManagerValve」，好名字一眼就能看出來(lái)作用。

其他還有問(wèn)題么?我們看看，通過(guò)ClientIp來(lái)判斷進(jìn)行Session共用。

最近 Tomcat 做了個(gè)bug fix，原因是這種通過(guò)ClientIp的判斷方式，當(dāng) Valve 配置在Engine下層，給多個(gè)Host 共用時(shí)，只能有一個(gè)Host生效。 fix之后，對(duì)于請(qǐng)求除ClientIp外，還有Host和 Context的限制，這些元素共同組成了 client標(biāo)識(shí)，就能更大程度上共用Session。

修改內(nèi)容如下：

總結(jié)下，該Valve 通過(guò)標(biāo)識(shí)識(shí)別出 Spider 請(qǐng)求后，給其分配一個(gè)固定的Session，從而避免大量的Session創(chuàng)建導(dǎo)致我資源占用。

默認(rèn)該Valve未開啟，需要在 server.xml中增加配置開啟。另外我們看上面提供的正則 pattern，和taobao 的robots.txt對(duì)比下，你會(huì)出現(xiàn)并沒(méi)有包含國(guó)內(nèi)的這些搜索引擎的處理，這個(gè)時(shí)候怎么辦呢?

在配置的時(shí)候傳一下進(jìn)來(lái)就OK啦，這是個(gè)public 的屬性

public void setCrawlerUserAgents(String crawlerUserAgents) { 
    this.crawlerUserAgents = crawlerUserAgents; 
    if (crawlerUserAgents == null || crawlerUserAgents.length() == 0) { 
        uaPattern = null; 
    } else { 
        uaPattern = Pattern.compile(crawlerUserAgents); 
    } 
}

【本文為51CTO專欄作者“侯樹成”的原創(chuàng)稿件，轉(zhuǎn)載請(qǐng)通過(guò)作者微信公眾號(hào)『Tomcat那些事兒』獲取授權(quán)】

戳這里，看該作者更多好文

責(zé)任編輯：趙寧寧來(lái)源： 51CTO專欄

Tomcat 理搜索引擎爬蟲

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<big id="6i65s"></big>

<rt id="6i65s"></rt>

<big id="6i65s"></big>