使用defaultdict对从输入字符串中每个位置开始的每个子字符串进行计数。OP尚不清楚是否应该包含重叠的匹配项,这种蛮力方法包括它们。from collections import defaultdictdef getsubs(loc, s): substr = s[loc:] i = -1 while(substr): yield substr substr = s[loc:i] i -= 1def longestRepetitiveSubstring(r, minocc=3): occ = defaultdict(int) # tally all occurrences of all substrings for i in range(len(r)): for sub in getsubs(i,r): occ[sub] += 1 # filter out all substrings with fewer than minocc occurrences occ_minocc = [k for k,v in occ.items() if v >= minocc] if occ_minocc: maxkey = max(occ_minocc, key=len) return maxkey, occ[maxkey] else: raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))印刷品:('helloworld', 3)
让我们从头开始,计算频率,并在出现最频繁的元素3次或更多次后立即停止。from collections import Countera='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'times=3for n in range(1,len(a)/times+1)[::-1]: substrings=[a[i:i+n] for i in range(len(a)-n+1)] freqs=Counter(substrings) if freqs.most_common(1)[0][1]>=3: seq=freqs.most_common(1)[0][0] breakprint "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)结果:>>> sequence 'helloworld' of length 10 occurs 3 or more times编辑:如果您感觉自己正在处理随机输入,并且公共子字符串的长度应该很小,那么最好以小子字符串开始(如果需要速度),而当找不到任何出现在该子字符串时停止至少3次:from collections import Countera='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'times=3for n in range(1,len(a)/times+1): substrings=[a[i:i+n] for i in range(len(a)-n+1)] freqs=Counter(substrings) if freqs.most_common(1)[0][1]<3: n-=1 break else: seq=freqs.most_common(1)[0][0]print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times) 与上述相同的结果。