偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

手把手教你寫網(wǎng)絡(luò)爬蟲（4）：Scrapy入門

作者：佚名 2018-05-16 13:50:30

開發(fā) 后端

本文介紹Scrapy的架構(gòu)，包括組件以及在系統(tǒng)中發(fā)生的數(shù)據(jù)流的概覽(紅色箭頭所示)。之后會(huì)對每個(gè)組件做簡單介紹，數(shù)據(jù)流也會(huì)做一個(gè)簡要描述。

本系列：

《手把手教你寫網(wǎng)絡(luò)爬蟲（1）：網(wǎng)易云音樂歌單》
《手把手教你寫網(wǎng)絡(luò)爬蟲（2）：迷你爬蟲架構(gòu)》
《手把手教你寫網(wǎng)絡(luò)爬蟲（3）：開源爬蟲框架對比》

上期我們理性的分析了為什么要學(xué)習(xí)Scrapy，理由只有一個(gè)，那就是免費(fèi)，一分錢都不用花！

咦？怎么有人扔西紅柿？好吧，我承認(rèn)電視看多了。不過今天是沒得看了，為了趕稿，又是一個(gè)不眠夜。。。言歸正傳，我們將在這一期介紹完Scrapy的基礎(chǔ)知識，如果想深入研究，大家可以參考官方文檔，那可是出了名的全面，我就不占用公眾號的篇幅了。

[[229476]]

架構(gòu)簡介

下面是Scrapy的架構(gòu)，包括組件以及在系統(tǒng)中發(fā)生的數(shù)據(jù)流的概覽(紅色箭頭所示)。之后會(huì)對每個(gè)組件做簡單介紹，數(shù)據(jù)流也會(huì)做一個(gè)簡要描述。

架構(gòu)就是這樣，流程和我第二篇里介紹的迷你架構(gòu)差不多，但擴(kuò)展性非常強(qiáng)大。

One more thing

[[229477]]

scrapy startproject tutorial

該命令將會(huì)創(chuàng)建包含下列內(nèi)容的 tutorial 目錄:

tutorial/  
    scrapy.cfg            # 項(xiàng)目的配置文件  
    tutorial/             # 該項(xiàng)目的python模塊。之后您將在此加入代碼  
        __init__.py  
        items.py          # 項(xiàng)目中的item文件  
        pipelines.py      # 項(xiàng)目中的pipelines文件  
        settings.py       # 項(xiàng)目的設(shè)置文件  
        spiders/          # 放置spider代碼的目錄  
            __init__.py

編寫***個(gè)爬蟲

Spider是用戶編寫用于從單個(gè)網(wǎng)站(或者一些網(wǎng)站)爬取數(shù)據(jù)的類。其包含了一個(gè)用于下載的初始URL，以及如何跟進(jìn)網(wǎng)頁中的鏈接以及如何分析頁面中的內(nèi)容的方法。

以下為我們的***個(gè)Spider代碼，保存在 tutorial/spiders 目錄下的 quotes_spider.py文件中:

import scrapy   
 
class QuotesSpider(scrapy.Spider):  
    name = "quotes"   
 
    def start_requests(self):  
        urls = [  
            'http://quotes.toscrape.com/page/1/',  
            'http://quotes.toscrape.com/page/2/',  
        ]  
        for url in urls:  
            yield scrapy.Request(url=url, callback=self.parse)   
 
    def parse(self, response):  
        page = response.url.split("/")[-2]  
        filename = 'quotes-%s.html' % page  
        with open(filename, 'wb') as f:  
            f.write(response.body)  
        self.log('Saved file %s' % filename)

運(yùn)行我們的爬蟲

進(jìn)入項(xiàng)目的根目錄，執(zhí)行下列命令啟動(dòng)spider:

scrapy crawl quotes

這個(gè)命令啟動(dòng)用于爬取 quotes.toscrape.com 的spider，你將得到類似的輸出:

2017-05-10 20:36:17 [scrapy.core.engine] INFO: Spider opened  
2017-05-10 20:36:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
2017-05-10 20:36:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023  
2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)  
2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)  
2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)  
2017-05-10 20:36:17 [quotes] DEBUG: Saved file quotes-1.html  
2017-05-10 20:36:17 [quotes] DEBUG: Saved file quotes-2.html  
2017-05-10 20:36:17 [scrapy.core.engine] INFO: Closing spider (finished)

提取數(shù)據(jù)

我們之前只是保存了HTML頁面，并沒有提取數(shù)據(jù)?，F(xiàn)在升級一下代碼，把提取功能加進(jìn)去。至于如何使用瀏覽器的開發(fā)者模式分析網(wǎng)頁，之前已經(jīng)介紹過了。

import scrapy   
class QuotesSpider(scrapy.Spider):  
    name = "quotes"  
    start_urls = [  
        'http://quotes.toscrape.com/page/1/',  
        'http://quotes.toscrape.com/page/2/',  
    ]   
 
    def parse(self, response):  
        for quote in response.css('div.quote'):  
            yield {  
                'text': quote.css('span.text::text').extract_first(),  
                'author': quote.css('small.author::text').extract_first(),  
                'tags': quote.css('div.tags a.tag::text').extract(),  
            }

再次運(yùn)行這個(gè)爬蟲，你將在日志里看到被提取出的數(shù)據(jù)：

2017-05-10 20:38:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>  
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}  
2017-05-10 20:38:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>  
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

保存爬取的數(shù)據(jù)

最簡單存儲爬取的數(shù)據(jù)的方式是使用 Feed exports:

scrapy crawl quotes -o quotes.json

該命令將采用 JSON 格式對爬取的數(shù)據(jù)進(jìn)行序列化，生成quotes.json文件。

在類似本篇教程里這樣小規(guī)模的項(xiàng)目中，這種存儲方式已經(jīng)足夠。如果需要對爬取到的item做更多更為復(fù)雜的操作，你可以編寫 Item Pipeline，tutorial/pipelines.py在最開始的時(shí)候已經(jīng)自動(dòng)創(chuàng)建了。

責(zé)任編輯：龐桂玉來源： Python開發(fā)者

Python 網(wǎng)絡(luò)爬蟲 Scrapy

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<blockquote id="hv2td"></blockquote>

<abbr id="hv2td"><button id="hv2td"></button></abbr>