手记

如何统计序列中元素的出现频度

案例:
1某随机序列中,找到出现次数最高的3个元素,他们的出现次数是多少?
2.某英文文章的单词,进行词频统计,找到出现次数最高的10个单词,他们的出现次数是多少?
step1:列表解析创建随机序列
step2:统计结果应是字典,创建value全为0的字典
step3:根据字典中的值,对字典中的项进行统计

In [1]: from random import randint

In [2]: data = [randint(0,20) for _ in xrange(30)]

In [3]: data
Out[3]: 
[0,
 0,
 17,
 5,
 5,
 10,
 3,
 17,
 20,
 13,
 14,
 17,
 16,
 17,
 13,
 8,
 6,
 14,
 1,
 18,
 2,
 5,
 6,
 10,
 20,
 12,
 7,
 7,
 5,
 10]

In [4]: c = dict.fromkeys(data,0)

In [5]: c
Out[5]: 
{0: 0,
 1: 0,
 2: 0,
 3: 0,
 5: 0,
 6: 0,
 7: 0,
 8: 0,
 10: 0,
 12: 0,
 13: 0,
 14: 0,
 16: 0,
 17: 0,
 18: 0,
 20: 0}

In [6]: for x in data:
   ...:     c[x] += 1
   ...:     

In [7]: c
Out[7]: 
{0: 2,
 1: 1,
 2: 1,
 3: 1,
 5: 4,
 6: 2,
 7: 2,
 8: 1,
 10: 3,
 12: 1,
 13: 2,
 14: 2,
 16: 1,
 17: 4,
 18: 1,
 20: 2}

解决方案:
使用collections.Counter对象,将序列传入Counter的构造器,得到Counter对象是元素频度的字典。
英文文章词频统计
利用正则表达使:用非字母形式对文章进行分割
re.split('\W+',txt)
Counter.most_common(n)方法得到的频度最高的n个元素的列表

In [8]: from collections import Counter

In [9]: c2 = Counter(data)

In [10]: c
Out[10]: 
{0: 2,
 1: 1,
 2: 1,
 3: 1,
 5: 4,
 6: 2,
 7: 2,
 8: 1,
 10: 3,
 12: 1,
 13: 2,
 14: 2,
 16: 1,
 17: 4,
 18: 1,
 20: 2}

In [11]: c2
Out[11]: Counter({5: 4, 17: 4, 10: 3, 0: 2, 6: 2, 7: 2, 13: 2, 14: 2, 20: 2, 1: 1, 2: 1, 3: 1, 8: 1, 12: 1, 16: 1, 18: 1})

In [12]: c2.most_common(3)
Out[12]: [(5, 4), (17, 4), (10, 3)]

In [13]: c2.most_common(10)
Out[13]: 
[(5, 4),
 (17, 4),
 (10, 3),
 (0, 2),
 (6, 2),
 (7, 2),
 (13, 2),
 (14, 2),
 (20, 2),
 (1, 1)]

In [14]: import re

In [15]: txt = open('test.txt').read()

In [16]: c3 = Counter(re.split('\W+',txt))

In [17]: c3

In [18]: c3.most_common(10)
Out[18]: 
[('00', 1023),
 ('0', 764),
 ('p', 563),
 ('fd', 513),
 ('so', 434),
 ('00000000', 418),
 ('usr', 387),
 ('lib64', 382),
 ('r', 297),
 ('1', 284)]
0人推荐
随时随地看视频
慕课网APP