淺談利用LINQ進(jìn)行分組統(tǒng)計(jì)
讓我們來(lái)生成要統(tǒng)計(jì)的數(shù)據(jù),如下所示:
- IEnumerable<Tuple<int, double>> GetTuples(int n)
 - {
 - var tuples = new Tuple<int, double>[n];
 - var rand = new Random();
 - for (int k = 1, i = 0; i < n; i++)
 - {
 - var r = rand.Next(n);
 - k += (r >= n - 3) ? 2 : ((r >= n - 9) ? 1 : 0);
 - tuples[i] = new Tuple<int, double>(k, rand.NextDouble());
 - }
 - return tuples;
 - }
 
該方法生成 n 項(xiàng)已經(jīng)排好序的數(shù)據(jù)。
現(xiàn)在,讓我們來(lái)按關(guān)鍵字分組,并統(tǒng)計(jì)每組的個(gè)數(shù)和平均值。
首先,使用 C# 的 foreach 循環(huán),如下所示:
- IEnumerable<Tuple<int, int, double>> ForEach(IEnumerable<Tuple<int, double>> tuples)
 - {
 - var result = new List<Tuple<int, int, double>>();
 - var count = 0;
 - var sum = 0.0;
 - int? key = null;
 - foreach (var v in tuples)
 - {
 - if (key != v.Item1)
 - {
 - if (key != null) result.Add(new Tuple<int, int, double>(key.Value, count, sum / count));
 - sum = count = 0;
 - key = v.Item1;
 - }
 - count++;
 - sum += v.Item2;
 - }
 - if (key != null) result.Add(new Tuple<int, int, double>(key.Value, count, sum / count));
 - return result;
 - }
 
這種方法有個(gè)***的缺點(diǎn)就是在 foreach 循環(huán)結(jié)束之后還要進(jìn)行一次統(tǒng)計(jì),聞到了代碼的“壞味道”。
那么,就讓我們來(lái)重構(gòu)吧,這次,使用迭代器進(jìn)行循環(huán):
- IEnumerable<Tuple<int, int, double>> Iterate(IEnumerable<Tuple<int, double>> tuples)
 - {
 - var result = new List<Tuple<int, int, double>>();
 - var count = 0;
 - var sum = 0.0;
 - int? key = null;
 - for (var iter = tuples.GetEnumerator(); ; count++, sum += iter.Current.Item2)
 - {
 - var hasValue = iter.MoveNext();
 - if (!hasValue || key != iter.Current.Item1)
 - {
 - if (key != null) result.Add(new Tuple<int, int, double>(key.Value, count, sum / count));
 - if (!hasValue) break;
 - sum = count = 0;
 - key = iter.Current.Item1;
 - }
 - }
 - return result;
 - }
 
這樣,就消滅了“壞味道”。
注意,以上兩種方法都假設(shè)輸入數(shù)據(jù)已經(jīng)排好序。如若不然,就要先對(duì)輸入數(shù)據(jù)進(jìn)行一次排序。
***,如果使用LINQ的話,還可以更簡(jiǎn)單:
- IEnumerable<Tuple<int, int, double>> Linq(IEnumerable<Tuple<int, double>> tuples)
 - {
 - var result = new List<Tuple<int, int, double>>();
 - var q = from k in tuples group k by k.Item1;
 - foreach (var g in q) result.Add(new Tuple<int, int, double>(g.Key, g.Count(), g.Average(v => v.Item2)));
 - return result;
 - }
 
要注意LINQ 方法無(wú)論是運(yùn)行時(shí)間還是占用的內(nèi)存都更大。
我們來(lái)看看 Main 方法:
- static void Main(string[] args)
 - {
 - try
 - {
 - new Program().Run(Console.Out, int.Parse(args[0]));
 - }
 - catch (Exception ex)
 - {
 - Console.WriteLine(ex);
 - }
 - }
 - void Run(TextWriter writer, int n)
 - {
 - var tuples = GetTuples(n * 1024 * 1024);
 - Write("ForEach", writer, ForEach(tuples));
 - Write("Iterate", writer, Iterate(tuples));
 - Write(" Linq ", writer, Linq(tuples));
 - }
 
其中的 Write 方法如下所示:
- void Write(string title, TextWriter writer, IEnumerable<Tuple<int, int, double>> tuples)
 - {
 - writer.WriteLine("==========> " + title + " <============");
 - writer.WriteLine("Key ------Count Average----------");
 - var count = 0;
 - var sum = 0.0;
 - foreach (var t in tuples)
 - {
 - writer.WriteLine("{0,3} {1,11:N0} {2}", t.Item1, t.Item2, t.Item3);
 - count += t.Item2;
 - sum += t.Item2 * t.Item3;
 - }
 - writer.WriteLine("--- ----------- -----------------");
 - writer.WriteLine("{0,3} {1,11:N0} {2}", tuples.Count(), count, sum / count);
 - writer.WriteLine();
 - }
 
***,這個(gè)程序的輸出如下所示:
==========> ForEach <============
Key ------Count Average----------
  1      10,476 0.492122426354162
  2   1,633,289 0.499917991099794
  3     981,345 0.500446307804579
  5   1,542,377 0.500567888024527
  6     478,158 0.499376479287702
  8      62,325 0.501552373474687
  9   1,463,104 0.500270067230854
 11     802,680 0.500518684820775
 13     367,798 0.499572390413821
 14     492,947 0.500767958524
 16   2,403,053 0.500023199420802
 17     248,208 0.499988049057847
--- ----------- -----------------
 12  10,485,760 0.50018897689056
==========> Iterate <============
Key ------Count Average----------
  1      10,476 0.492122426354162
  2   1,633,289 0.499917991099794
  3     981,345 0.500446307804579
  5   1,542,377 0.500567888024527
  6     478,158 0.499376479287702
  8      62,325 0.501552373474687
  9   1,463,104 0.500270067230854
 11     802,680 0.500518684820775
 13     367,798 0.499572390413821
 14     492,947 0.500767958524
 16   2,403,053 0.500023199420802
 17     248,208 0.499988049057847
--- ----------- -----------------
 12  10,485,760 0.50018897689056
==========>  Linq   <============
Key ------Count Average----------
  1      10,476 0.492122426354162
  2   1,633,289 0.499917991099794
  3     981,345 0.500446307804579
  5   1,542,377 0.500567888024527
  6     478,158 0.499376479287702
  8      62,325 0.501552373474687
  9   1,463,104 0.500270067230854
 11     802,680 0.500518684820775
 13     367,798 0.499572390413821
 14     492,947 0.500767958524
 16   2,403,053 0.500023199420802
 17     248,208 0.499988049057847
--- ----------- -----------------
 12  10,485,760 0.50018897689056
 
這個(gè)程序中用到的 Tuple 類如下所示:
- class Tuple<T1, T2>
 - {
 - public T1 Item1 { get; private set; }
 - public T2 Item2 { get; private set; }
 - public Tuple(T1 item1, T2 item2) { Item1 = item1; Item2 = item2; }
 - }
 - class Tuple<T1, T2, T3> : Tuple<T1, T2>
 - {
 - public T3 Item3 { get; private set; }
 - public Tuple(T1 item1, T2 item2, T3 item3) : base(item1, item2) { Item3 = item3; }
 - }
 
其實(shí) .NET Framework 4.0 Base Class Library 中已經(jīng)有 Tuple 類了。
本文中的全部源程序代碼可以在這里下載。
【編輯推薦】















 
 
 
 
 
 
 