Python多線程下載有聲小說(shuō)
有經(jīng)驗(yàn)的老鳥都(未婚的)會(huì)在公司附近租房,免受舟車勞頓之苦的同時(shí)節(jié)約了大把時(shí)間;也有些人出于某種原因需要每天披星戴月地游走于公司與家之間,很不幸俺就是這其中一員。由于家和公司離得比較遠(yuǎn),我平時(shí)在公交車上的時(shí)間占據(jù)了工作時(shí)間段的1/4,再加上杭州一向有中國(guó)的拉斯維加斯之稱(堵城),每每堵起來(lái),哥都能想象自己成為變形金剛。這段漫長(zhǎng)時(shí)間我想作為每個(gè)程序猿來(lái)說(shuō)是無(wú)法忍受的,可是既然短時(shí)間無(wú)法改變生存的現(xiàn)狀,咱就好好利用這段時(shí)間吧。所以,我特地買了大屏幕的Note II 以便看pdf,另外耳朵也不能閑著,不過(guò)咱不是聽英語(yǔ)而是聽小說(shuō),我在讀書的時(shí)候就喜歡聽廣播,特別是說(shuō)書、相聲等,所以我需要大量的有聲小說(shuō),現(xiàn)在網(wǎng)上這些資源多的很,但是下載頁(yè)記為麻煩,為了掙取更多的流量和廣告點(diǎn)擊,這些網(wǎng)站的下載鏈接都需要打開至少兩個(gè)以上的網(wǎng)頁(yè)才能找到真正的鏈接,甚是麻煩,為了節(jié)省整體下載時(shí)間,我寫了這個(gè)小程序,方便自己和大家下載有聲小說(shuō)(當(dāng)然,還有任何其他類型的資源)
先說(shuō)明一下,我不是為了爬很多資料和數(shù)據(jù),僅僅是為了娛樂(lè)和學(xué)習(xí),所以這里不會(huì)漫無(wú)目的的取爬取一個(gè)網(wǎng)站的所有鏈接,而是給定一個(gè)小說(shuō),比方說(shuō)我要下載小說(shuō)《童年》,我會(huì)在我聽評(píng)書網(wǎng)上找到該小說(shuō)的主頁(yè)然后用程序下載所有mp3音頻,具體做法見下面代碼,所有代碼都在模塊crawler5tps中:
1. 先設(shè)定一下start url 和保存文件的目錄
- #-*-coding:GBK-*-
- import urllib,urllib2
- import re,threading,os
- baseurl = 'http://www.5tps.com' #base url
- down2path = 'E:/enovel/' #saving path
- save2path = '' #saving file name (full path)
2. 從start url 解析下載頁(yè)面的url
- def parseUrl(starturl):
- '''''
- parse out download page from start url.
- eg. we can get 'http://www.5tps.com/down/8297_52_1_1.html' from 'http://www.5tps.com/html/8297.html'
- '''
- global save2path
- rDownloadUrl = re.compile(".*?<A href=\'(/down/\w+\.html)\'.*") #find the link of download page
- #rTitle = re.compile("<TITILE>.{4}\s{1}(.*)\s{1}.*</TITLE>")
- #<TITLE>有聲小說(shuō) 悶騷1 播音:劉濤 全集</TITLE>
- f = urllib2.urlopen(starturl)
- totalLine = f.readlines()
- ''''' create the name of saving file '''
- title = totalLine[3].split(" ")[1]
- if os.path.exists(down2path+title) is not True:
- os.mkdir(down2path+title)
- save2path = down2path+title+"/"
- downUrlLine = [ line for line in totalLine if rDownloadUrl.match(line)]
- downLoadUrl = [];
- for dl in downUrlLine:
- while True:
- m = rDownloadUrl.match(dl)
- if not m:
- break
- downUrl = m.group(1)
- downLoadUrl.append(downUrl.strip())
- dl = dl.replace(downUrl,'')
- return downLoadUrl
3. 從下載頁(yè)面解析出真正的下載鏈接
- def getDownlaodLink(starturl):
- '''''
- find out the real download link from download page.
- eg. we can get the download link 'http://180j-d.ysts8.com:8000/人物紀(jì)實(shí)/童年/001.mp3?\
- 1251746750178x1356330062x1251747362932-3492f04cf54428055a110a176297d95a' from \
- 'http://www.5tps.com/down/8297_52_1_1.html'
- '''
- downUrl = []
- gbk_ClickWord = '點(diǎn)此下載'
- downloadUrl = parseUrl(starturl)
- rDownUrl = re.compile('<a href=\"(.*)\"><font color=\"blue\">'+gbk_ClickWord+'.*</a>') #find the real download link
- for url in downloadUrl:
- realurl = baseurl+url
- print realurl
- for line in urllib2.urlopen(realurl).readlines():
- m = rDownUrl.match(line)
- if m:
- downUrl.append(m.group(1))
- return downUrl
4. 定義下載函數(shù)
- def download(url,filename):
- ''''' download mp3 file '''
- print url
- urllib.urlretrieve(url, filename)
5. 創(chuàng)建用于下載文件的線程類
- class DownloadThread(threading.Thread):
- ''''' dowanload thread class '''
- def __init__(self,func,savePath):
- threading.Thread.__init__(self)
- self.function = func
- self.savePath = savePath
- def run(self):
- download(self.function,self.savePath)
6. 開始下載
- if __name__ == '__main__':
- starturl = 'http://www.5tps.com/html/8297.html'
- downUrl = getDownlaodLink(starturl)
- aliveThreadDict = {} # alive thread
- downloadingUrlDict = {} # downloading link
- i = 0;
- while i < len(downUrl):
- ''''' Note:我聽評(píng)說(shuō)網(wǎng) 只允許同時(shí)有三個(gè)線程下載同一部小說(shuō),但是有時(shí)受網(wǎng)絡(luò)等影響,\
- 為確保下載的是真實(shí)的mp3,這里將線程數(shù)設(shè)為2 '''
- while len(downloadingUrlDict)< 2 :
- downloadingUrlDict[i]=i
- i += 1
- for urlIndex in downloadingUrlDict.values():
- #argsTuple = (downUrl[urlIndex],save2path+str(urlIndex+1)+'.mp3')
- if urlIndex not in aliveThreadDict.values():
- t = DownloadThread(downUrl[urlIndex],save2path+str(urlIndex+1)+'.mp3')
- t.start()
- aliveThreadDict[t]=urlIndex
- for (th,urlIndex) in aliveThreadDict.items():
- if th.isAlive() is not True:
- del aliveThreadDict[th] # delete the thread slot
- del downloadingUrlDict[urlIndex] # delete the url from url list needed to download
- print 'Completed Download Work'
這樣就可以了,讓他盡情的下吧,咱還得碼其他的項(xiàng)目去,哎 >>>
等下了班copy到Note中就可以一邊聽小說(shuō)一邊看資料啦,***附上源碼。
原文鏈接:http://www.cnblogs.com/wuren/archive/2012/12/24/2831100.html