常用MapReduce數(shù)據(jù)挖掘算法之均值、方差
均值、方差的map-reduce
一堆數(shù)字的均值、方差公式,相信都很清楚,具體怎么設(shè)計(jì)map跟reduce函數(shù)呢,可以先從計(jì)算公式出發(fā),假設(shè)有n個(gè)數(shù)字,分別是a1,a2....an,那么 均值m=(a1+a2+...an) / n,方差 s= [(a1-m)^2+(a2-m)^2+....+(an-m)^2] / n
把方差公式展開來S=[(a1^2+.....an^2)+m^m*n-2*m*(a1+a2+....an) ] / n,根據(jù)這個(gè)我們可以把map端的輸入設(shè)定為(key,a1),輸出設(shè)定為(1,(n1,sum1,var1)),n1表示每個(gè)worker所計(jì)算的數(shù)字的個(gè)數(shù),sum1是這些數(shù)字的和(例如a1+a2+a3...),var1是這些數(shù)字的平方和(例如a1^2+a2^2+...)
reduce端接收到這些信息后緊接著把所有輸入的n1,n2....相加得到n,把sum1,sum2...相加得到sum,那么均值m=sum/n,把var1,var2...相加得到var,那么***的方差S=(var+m^2*n-2*m*sum)/n,reduce輸出(1,(m,S))。
算法代碼是基于mrjob的實(shí)現(xiàn)(https://pythonhosted.org/mrjob/,機(jī)器學(xué)習(xí)實(shí)戰(zhàn)第十五章)
- from mrjob.job import MRJob
 - class MRmean(MRJob):
 - def __init__(self, *args, **kwargs):
 - super(MRmean, self).__init__(*args, **kwargs)
 - self.inCount = 0
 - self.inSum = 0
 - self.inSqSum = 0
 - def map(self, key, val): #needs exactly 2 arguments
 - if False: yield
 - inVal = float(val)
 - self.inCount += 1
 - self.inSum += inVal #每個(gè)元素之和
 - self.inSqSum += inVal*inVal #求每個(gè)元素的平方
 - def map_final(self):
 - mn = self.inSum/self.inCount
 - mnSq =self.inSqSum/self.inCount
 - yield (1, [self.inCount, mn, mnSq]) #map的輸出,不過這里的mn=sum1/mn,mnsq=var1/mn
 - def reduce(self, key, packedValues):
 - cumVal=0.0; cumSumSq=0.0; cumN=0.0
 - for valArr in packedValues: #get values from streamed inputs 解析map端的輸出
 - nj = float(valArr[0])
 - cumN += nj
 - cumVal += nj*float(valArr[1])
 - cumSumSq += nj*float(valArr[2])
 - mean = cumVal/cumN
 - var = (cumSumSq - 2*mean*cumVal + cumN*mean*mean)/cumN
 - yield (mean, var) #emit mean and var reduce的輸出
 - def steps(self):
 - return ([self.mr(mapper=self.map, mapper_final=self.map_final,\
 - reducer=self.reduce,)])
 - if __name__ == '__main__':
 - MRmean.run()
 















 
 
 









 
 
 
 