Python多線程下載有聲小說
有經(jīng)驗的老鳥都(未婚的)會在公司附近租房,免受舟車勞頓之苦的同時節(jié)約了大把時間;也有些人出于某種原因需要每天披星戴月地游走于公司與家之間,很不幸俺就是這其中一員。由于家和公司離得比較遠(yuǎn),我平時在公交車上的時間占據(jù)了工作時間段的1/4,再加上杭州一向有中國的拉斯維加斯之稱(堵城),每每堵起來,哥都能想象自己成為變形金剛。這段漫長時間我想作為每個程序猿來說是無法忍受的,可是既然短時間無法改變生存的現(xiàn)狀,咱就好好利用這段時間吧。所以,我特地買了大屏幕的Note II 以便看pdf,另外耳朵也不能閑著,不過咱不是聽英語而是聽小說,我在讀書的時候就喜歡聽廣播,特別是說書、相聲等,所以我需要大量的有聲小說,現(xiàn)在網(wǎng)上這些資源多的很,但是下載頁記為麻煩,為了掙取更多的流量和廣告點擊,這些網(wǎng)站的下載鏈接都需要打開至少兩個以上的網(wǎng)頁才能找到真正的鏈接,甚是麻煩,為了節(jié)省整體下載時間,我寫了這個小程序,方便自己和大家下載有聲小說(當(dāng)然,還有任何其他類型的資源)
先說明一下,我不是為了爬很多資料和數(shù)據(jù),僅僅是為了娛樂和學(xué)習(xí),所以這里不會漫無目的的取爬取一個網(wǎng)站的所有鏈接,而是給定一個小說,比方說我要下載小說《童年》,我會在我聽評書網(wǎng)上找到該小說的主頁然后用程序下載所有mp3音頻,具體做法見下面代碼,所有代碼都在模塊crawler5tps中:
1. 先設(shè)定一下start url 和保存文件的目錄
- #-*-coding:GBK-*-
 - import urllib,urllib2
 - import re,threading,os
 - baseurl = 'http://www.5tps.com' #base url
 - down2path = 'E:/enovel/' #saving path
 - save2path = '' #saving file name (full path)
 
2. 從start url 解析下載頁面的url
- def parseUrl(starturl):
 - '''''
 - parse out download page from start url.
 - eg. we can get 'http://www.5tps.com/down/8297_52_1_1.html' from 'http://www.5tps.com/html/8297.html'
 - '''
 - global save2path
 - rDownloadUrl = re.compile(".*?<A href=\'(/down/\w+\.html)\'.*") #find the link of download page
 - #rTitle = re.compile("<TITILE>.{4}\s{1}(.*)\s{1}.*</TITLE>")
 - #<TITLE>有聲小說 悶騷1 播音:劉濤 全集</TITLE>
 - f = urllib2.urlopen(starturl)
 - totalLine = f.readlines()
 - ''''' create the name of saving file '''
 - title = totalLine[3].split(" ")[1]
 - if os.path.exists(down2path+title) is not True:
 - os.mkdir(down2path+title)
 - save2path = down2path+title+"/"
 - downUrlLine = [ line for line in totalLine if rDownloadUrl.match(line)]
 - downLoadUrl = [];
 - for dl in downUrlLine:
 - while True:
 - m = rDownloadUrl.match(dl)
 - if not m:
 - break
 - downUrl = m.group(1)
 - downLoadUrl.append(downUrl.strip())
 - dl = dl.replace(downUrl,'')
 - return downLoadUrl
 
3. 從下載頁面解析出真正的下載鏈接
- def getDownlaodLink(starturl):
 - '''''
 - find out the real download link from download page.
 - eg. we can get the download link 'http://180j-d.ysts8.com:8000/人物紀(jì)實/童年/001.mp3?\
 - 1251746750178x1356330062x1251747362932-3492f04cf54428055a110a176297d95a' from \
 - 'http://www.5tps.com/down/8297_52_1_1.html'
 - '''
 - downUrl = []
 - gbk_ClickWord = '點此下載'
 - downloadUrl = parseUrl(starturl)
 - rDownUrl = re.compile('<a href=\"(.*)\"><font color=\"blue\">'+gbk_ClickWord+'.*</a>') #find the real download link
 - for url in downloadUrl:
 - realurl = baseurl+url
 - print realurl
 - for line in urllib2.urlopen(realurl).readlines():
 - m = rDownUrl.match(line)
 - if m:
 - downUrl.append(m.group(1))
 - return downUrl
 
4. 定義下載函數(shù)
- def download(url,filename):
 - ''''' download mp3 file '''
 - print url
 - urllib.urlretrieve(url, filename)
 
5. 創(chuàng)建用于下載文件的線程類
- class DownloadThread(threading.Thread):
 - ''''' dowanload thread class '''
 - def __init__(self,func,savePath):
 - threading.Thread.__init__(self)
 - self.function = func
 - self.savePath = savePath
 - def run(self):
 - download(self.function,self.savePath)
 
6. 開始下載
- if __name__ == '__main__':
 - starturl = 'http://www.5tps.com/html/8297.html'
 - downUrl = getDownlaodLink(starturl)
 - aliveThreadDict = {} # alive thread
 - downloadingUrlDict = {} # downloading link
 - i = 0;
 - while i < len(downUrl):
 - ''''' Note:我聽評說網(wǎng) 只允許同時有三個線程下載同一部小說,但是有時受網(wǎng)絡(luò)等影響,\
 - 為確保下載的是真實的mp3,這里將線程數(shù)設(shè)為2 '''
 - while len(downloadingUrlDict)< 2 :
 - downloadingUrlDict[i]=i
 - i += 1
 - for urlIndex in downloadingUrlDict.values():
 - #argsTuple = (downUrl[urlIndex],save2path+str(urlIndex+1)+'.mp3')
 - if urlIndex not in aliveThreadDict.values():
 - t = DownloadThread(downUrl[urlIndex],save2path+str(urlIndex+1)+'.mp3')
 - t.start()
 - aliveThreadDict[t]=urlIndex
 - for (th,urlIndex) in aliveThreadDict.items():
 - if th.isAlive() is not True:
 - del aliveThreadDict[th] # delete the thread slot
 - del downloadingUrlDict[urlIndex] # delete the url from url list needed to download
 - print 'Completed Download Work'
 
這樣就可以了,讓他盡情的下吧,咱還得碼其他的項目去,哎 >>>

等下了班copy到Note中就可以一邊聽小說一邊看資料啦,***附上源碼。
原文鏈接:http://www.cnblogs.com/wuren/archive/2012/12/24/2831100.html















 
 
 





 
 
 
 