偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<thead id="moxbc"></thead>

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

巨細(xì)！小姐姐告訴你關(guān)于 BeautifulSoup 的一切

作者：派森醬 2021-10-05 21:03:54

網(wǎng)絡(luò) 通信技術(shù)

BeautifulSoup 用 NavigableString 類來包裝 tag 中的字符串，NavigableString 表示可遍歷的字符串。

[[427165]]

詳細(xì)了解 BeautifulSoup 爬蟲

前面第一篇文章是關(guān)于 BeautifulSoup 爬蟲的基礎(chǔ)知識詳解第一部分，主要介紹了 BeautifulSoup 爬蟲的安裝過程及簡介，同時(shí)又快速學(xué)習(xí)了利用 BeautifulSoup 技術(shù)定位標(biāo)簽、獲取標(biāo)簽內(nèi)容的相關(guān)知識點(diǎn)，今天的文章將深入地介紹 BeautifulSoup 技術(shù)的詳細(xì)語法及其相關(guān)用法。

1.BeautifulSoup 對象

BeautifulSoup 將復(fù)雜的 HTML 文檔轉(zhuǎn)換成一個(gè)樹形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是 Python 對象，BeautifulSoup 官方文檔將所有的對象歸納為以下四種：

Tag
NavigableString
BeautifulSoup
Comment

接下來詳細(xì)介紹 BeautifulSoup 的四個(gè)對象：

Tag

Tag 對象表示 XML 或 HTML 文檔中的標(biāo)簽，通俗地講就是 HTML 中的一個(gè)個(gè)標(biāo)簽，該對象與 HTML 或 XML 原生文檔中的標(biāo)簽相同。Tag 有很多方法和屬性，BeautifulSoup 中定義為 soup.Tag，其中 Tag 為 HTML 中的標(biāo)簽，比如 a、title 等，其結(jié)果返回完整的標(biāo)簽內(nèi)容，包括標(biāo)簽的屬性和內(nèi)容等。例如以下實(shí)例就是 Tag:

<title>BeautifulSoup 技術(shù)詳解</title> 
<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p>

以上的 HTML 代碼中，title、p 都是標(biāo)簽，起始標(biāo)簽和結(jié)束標(biāo)簽之間加上內(nèi)容就是 Tag。標(biāo)簽獲取方法代碼如下：

#創(chuàng)建本地文件soup對象 
   soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
   #獲取a標(biāo)簽 
   a = soup.a  #Tag 
   print('a標(biāo)簽的內(nèi)容是:', a)

除此之外，Tag 中最重要的屬性是 name 和 attrs 。

name

name 屬性用于獲取文檔樹的標(biāo)簽名字，如果想獲取 title 標(biāo)簽的名字，只要使用 soup.title.name 代碼即可，對于內(nèi)部標(biāo)簽，輸出的值便為標(biāo)簽本身的名稱。

attrsattrs是屬性(attributes)的英文簡稱，屬性是網(wǎng)頁標(biāo)簽的重要內(nèi)容。一個(gè)標(biāo)簽(Tag)可能有很多個(gè)屬性，例如：

<a href="https://www.baidu.com" class="xiaodu" id="l1">ddd</a>

以上實(shí)例存在兩個(gè)屬性，一個(gè)是class屬性，對應(yīng)的值為“xiaodu”;一個(gè)是id屬性，對應(yīng)的值為“l1”。Tag屬性操作方法與Python字典相同，獲取p標(biāo)簽的所有屬性代碼如下，得到一個(gè)字典類型的值，它獲取的是第一個(gè)段落 p 的屬性及屬性值。

# 獲取屬性 
print(soup.p.attrs) 
 
# 獲取屬性值 
print(soup.a['class']) 
#[u'xiaodu'] 
print(soup.a.get('class')) 
#[u'l1']

BeautifulSoup 每個(gè)標(biāo)簽 tag 可能有很多個(gè)屬性，可以通過 “.attrs” 獲取屬性，tag 的屬性可以被修改、刪除或添加。

NavigableString

NavigableString 也叫可遍歷的字符串，字符串常被包含在 tag 內(nèi),BeautifulSoup 用 NavigableString 類來包裝tag中的字符串，

BeautifulSoup 用 NavigableString 類來包裝 tag 中的字符串，NavigableString 表示可遍歷的字符串。一個(gè) NavigableString 字符串與 Python 中的 Unicode 字符串相同，并且支持包含在遍歷文檔樹和搜索文檔樹中的一些特性。下述代碼可查看 NavigableString 的類型。

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
print(type(tag.string))

輸出結(jié)果如下：

<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 對象表示的是一個(gè)文檔的全部內(nèi)容，通常情況下把它當(dāng)作 Tag 對象，該對象支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法，下面代碼是輸出 soup 對象的類型，輸出結(jié)果就是 BeautifulSoup 對象類型。

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(type(soup))

輸出結(jié)果如下：

<class 'bs4.BeautifulSoup'>

因?yàn)?BeautifulSoup 對象并不是真正的 HTML 或 XML 的標(biāo)簽 tag，所以它沒有 name 和 attribute 屬性。但有時(shí)查看它的.name 屬性是很方便的，故 BeautifulSoup 對象包含了一個(gè)值為[document]的特殊屬性soup.name。下述代碼即是輸出 BeautifulSoup 對象的 name 屬性，其值為 [document]。

Comment

Comment 對象是一個(gè)特殊類型的 NavigableString 對象，它用于處理注釋對象。下面這個(gè)示例代碼用于讀取注釋內(nèi)容，代碼如下：

markup = "<b><!-- hello comment code --></b>" 
    soup = BeautifulSoup(markup, "html.parser") 
    comment = soup.b.string 
    print(type(comment)) 
    print(comment) 
     
if __name__ == '__main__': 
    mark()

輸出結(jié)果如下：

<class 'bs4.BeautifulSoup'> 
<class 'bs4.element.Comment'> 
 hello comment code

2.遍歷文檔樹

以上內(nèi)容講解完 4 個(gè)對象后，下面的知識講解遍歷文檔樹和搜索文檔樹以及 BeatifulSoup 常用的函數(shù)。在 BeautifulSoup 中，一個(gè)標(biāo)簽(Tag)可能包含多個(gè)字符串或其它的標(biāo)簽，這些稱為這個(gè)標(biāo)簽的子標(biāo)簽。

咱們繼續(xù)用以下超文本協(xié)議來講解：

<!DOCTYPE html> 
<html lang="en"> 
<head> 
    <title>BeautifulSoup 技術(shù)詳解</title> 
</head> 
<body> 
<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p> 
 
<a href="https://www.baidu.com" class="xiaodu" id="l1">ddd</a> 
 
</body> 
</html>

子節(jié)點(diǎn)

一個(gè)Tag可能包含多個(gè)字符串或其它的Tag,這些都是這個(gè)Tag的子節(jié)點(diǎn)，Beautiful Soup 提供了許多操作和遍歷子節(jié)點(diǎn)的屬性。

例如獲取標(biāo)簽子節(jié)點(diǎn)內(nèi)容：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(soup.head.contents)

輸出結(jié)果如下：

['\n', <title>BeautifulSoup 技術(shù)詳解</title>, '\n']

注意: Beautiful Soup中字符串節(jié)點(diǎn)不支持這些屬性,因?yàn)樽址疀]有子節(jié)點(diǎn)。

節(jié)點(diǎn)內(nèi)容

如果標(biāo)簽只有一個(gè)子節(jié)點(diǎn)，需要獲取該子節(jié)點(diǎn)的內(nèi)容，則需要使用 string 屬性，以此輸出節(jié)點(diǎn)的內(nèi)容：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(soup.head.string) 
 
print(soup.title.string)

輸出結(jié)果如下：

None 
BeautifulSoup 技術(shù)詳解

父節(jié)點(diǎn)

調(diào)用 parent 屬性定位父節(jié)點(diǎn)，如果需要獲取節(jié)點(diǎn)的標(biāo)簽名則使用 parent.name。實(shí)例如下：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
p = soup.p 
print(p.parent) 
print(p.parent.name) 
 
content = soup.head.title.string 
print(content.parent) 
print(content.parent.name)

輸出結(jié)果如下：

<body> 
<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p> 
<a class="xiaodu" href="https://www.baidu.com" id="l1">ddd</a> 
</body> 
body 
<title>BeautifulSoup 技術(shù)詳解</title> 
title

兄弟節(jié)點(diǎn)

兄弟節(jié)點(diǎn)是指和本節(jié)點(diǎn)位于同一級的節(jié)點(diǎn)，其中 next_sibling 屬性是獲取該節(jié)點(diǎn)的下一個(gè)兄弟節(jié)點(diǎn)，previous_sibling 則與之相反，取該節(jié)點(diǎn)的上一個(gè)兄弟節(jié)點(diǎn)，如果節(jié)點(diǎn)不存在，則返回 None。

print(soup.p.next_sibling) 
print(soup.p.prev_sibling)

前后節(jié)點(diǎn)

調(diào)用屬性 next_element 可以獲取下一個(gè)節(jié)點(diǎn)，調(diào)用屬性 previous_element 可以獲取上一個(gè)節(jié)點(diǎn)，代碼舉例如下：

print(soup.p.next_element) 
print(soup.p.previous_element)

3.搜索文檔樹

BeautifulSoup 定義了很多搜索方法，例如 find() 和 find_all(); 但find_all()是最常用的一種方法，而更多的方法與遍歷文檔樹類似，包括父節(jié)點(diǎn)、子節(jié)點(diǎn)、兄弟節(jié)點(diǎn)等，使用find_all()方法的代碼如下：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
urls = soup.find_all('p') 
for u in urls: 
    print(u)

輸出結(jié)果如下：

<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p>

使用 find_all() 可以查找到想要查找的文檔內(nèi)容。

總結(jié)

至此，阿醬理解范圍內(nèi)的 BeautifulSoup 基礎(chǔ)知識及用法基本上已經(jīng)概述完畢，有差池的地方希望大家海涵，我們一起努力前行。

參考

BeautifulSoup 官網(wǎng)https://blog.csdn.net/Eastmount

責(zé)任編輯：武曉燕來源： Python技術(shù)

BeautifulSoup 爬蟲

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<em id="pttjp"></em>

<em id="pttjp"></em>

<menuitem id="pttjp"><b id="pttjp"></b></menuitem>