根据重叠值将字典拆分为字典列表

我有一本带有染色体坐标的字典,如下例所示:


First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],

Key2: ['chr10', 19010495, 19014658],

Key3: ['chr10', 19010502, 19014641],

Key4: ['chr10', 37375766, 37377526],

Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],

Key6: ['chr11', 14806147, 14814006]} 

我想创建一个字典列表,其中那些具有染色体坐标最小值和最大值(字典值)的当前键与至少1000重叠,被分组到一个新字典中,其余的是新列表中的单独字典。


所以理想情况下,像这样:


New_list = 

[{Key1: ['chr10', 19010495, 19014590, 19014064],Key2: ['chr10', 19010495, 19014658], Key3: ['chr10', 19010502, 19014641]}, 

{Key4: ['chr10', 37375766, 37377526]},

{Key5: ['chr10', 76310389, 76315990, 76312224, 76312963]},

{Key6: ['chr11', 14806147, 14814006]}]

其中 key1、key2 和 key3 作为新字典组合在一起,New_list因为它们的染色体坐标重叠,而 key4、key5、key6 是具有New_list的单个字典,因为它们根本不重叠。


我最初的想法是将“First_dict”分离到一个字典列表中,使用


[{k: v} for (k, v) in First_dict.items()]

然后循环访问每个字典,将最小值和最大值与上一个字典进行比较,以检查重叠,然后创建一个新列表。但是我有几个问题,我无法解决问题。


我还寻找了将字典分组在一起的其他问题,例如在问题中:将Python字典键分组为列表,并使用此列表作为值创建一个新字典。


但我的问题是,我的 Vales 并不总是完全相同,就像上面的例子一样。在考虑重叠时,我也必须考虑染色体。


任何人都可以帮忙,或者提出一个尝试的建议吗?多谢。


慕容森
浏览 91回答 1
1回答

MYYA

这个问题可能更适合基于图形的解决方案。没有任何方法可以防止多个范围以不同的时间间隔重叠。#!/usr/bin/env python3&nbsp;&nbsp;from pprint import pprintfrom itertools import groupbydef mapper(d, overlap=1000):&nbsp; &nbsp; """Each chromsomal coordinate must be interrogated&nbsp; &nbsp; to determine if it is within +/-overlap of any other&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; Range within any other&nbsp; &nbsp; Original Dictionary&nbsp; &nbsp; &nbsp;Transcript&nbsp; &nbsp; value will match&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; key and chromosome&nbsp; &nbsp; &nbsp; element from the list&nbsp; &nbsp; ------------------------&nbsp; ----------------------&nbsp; ----------&nbsp; &nbsp; (el-overlap, el+overlap), (dict-key, chromosome), el)&nbsp; &nbsp; """&nbsp; &nbsp; for key, ch in d.items():&nbsp; &nbsp; &nbsp; &nbsp; for el in ch[1:]:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield ((el-overlap, el+overlap), (key, ch[0]), el)def sorted_mapper(d, overlap=1000):&nbsp; &nbsp; """Simply sort the mapper data by its first element&nbsp; &nbsp; """&nbsp; &nbsp; for r in sorted(mapper(d, overlap), key=lambda x: x[0]):&nbsp; &nbsp; &nbsp; &nbsp; yield rdef groups(iter_):&nbsp; &nbsp; previous = next(iter_)&nbsp; &nbsp; retval = [previous]&nbsp; &nbsp; for chrm in iter_:&nbsp; &nbsp; &nbsp; &nbsp; if previous[0][0] <= chrm[-1] <= previous[0][1]:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval.append(chrm)&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield retval&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; previous = chrm&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval = [previous]&nbsp; &nbsp; yield retvaldef reduce_phase1(iter_):&nbsp; &nbsp; for l in iter_:&nbsp; &nbsp; &nbsp; &nbsp; retval = {}&nbsp; &nbsp; &nbsp; &nbsp; for (minc, maxc), (key, lbl), chrm in l:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; x = retval.get(key,[lbl])&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; x.append(chrm)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval[key] = x&nbsp; &nbsp; &nbsp; &nbsp; yield retvaldef update_dict(d1, d2):&nbsp; &nbsp; retval = d1&nbsp; &nbsp; for key, value in d2.items():&nbsp; &nbsp; &nbsp; &nbsp; if key in d1.keys():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval[key].extend(value[1:])&nbsp; &nbsp; return retvaldef reduce_phase2(iter_):&nbsp; &nbsp; retval = [next(iter_)]&nbsp; &nbsp; retval_keys = [set([k for k in retval[0].keys()])]&nbsp; &nbsp; for d in iter_:&nbsp; &nbsp; &nbsp; &nbsp; keyset = set([k for k in d.keys()])&nbsp; &nbsp; &nbsp; &nbsp; isnew = True&nbsp; &nbsp; &nbsp; &nbsp; for i, e in enumerate(retval_keys):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if keyset <= e:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; isnew = False&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval[i] = update_dict(retval[i], d)&nbsp; &nbsp; &nbsp; &nbsp; if isnew:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval.append(d)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; retval_keys.append(keyset)&nbsp; &nbsp; return retvalFirst_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],Key2: ['chr10', 19010495, 19014658],Key3: ['chr10', 19010502, 19014641],Key4: ['chr10', 37375766, 37377526],Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],Key6: ['chr11', 14806147, 14814006]}&nbsp;New_list = [&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Key1": ['chr10', 19010495, 19014590, 19014064],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Key2": ['chr10', 19010495, 19014658],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Key3": ['chr10', 19010502, 19014641]&nbsp; &nbsp; &nbsp; &nbsp; },&nbsp; &nbsp; &nbsp; &nbsp; {"Key4": ['chr10', 37375766, 37377526]},&nbsp; &nbsp; &nbsp; &nbsp; {"Key5": ['chr10', 76310389, 76315990, 76312224, 76312963]},&nbsp; &nbsp; &nbsp; &nbsp; {"Key6": ['chr11', 14806147, 14814006]}]pprint(First_dict)print('-'*40)g = groups(sorted_ranges(First_dict))p1 = reduce_phase1(groups(sorted_ranges(First_dict)))p2 = reduce_phase2(p1)pprint(p2)输出{'Key1': ['chr10', 19010495, 19014590, 19014064],&nbsp;'Key2': ['chr10', 19010495, 19014658],&nbsp;'Key3': ['chr10', 19010502, 19014641],&nbsp;'Key4': ['chr10', 37375766, 37377526],&nbsp;'Key5': ['chr10', 76310389, 76315990, 76312224, 76312963],&nbsp;'Key6': ['chr11', 14806147, 14814006]}----------------------------------------[{'Key6': ['chr11', 14806147, 14814006]},&nbsp;{'Key1': ['chr10', 19010495, 19014064, 19014590],&nbsp; 'Key2': ['chr10', 19010495, 19014658],&nbsp; 'Key3': ['chr10', 19010502, 19014641]},&nbsp;{'Key4': ['chr10', 37375766, 37377526]},&nbsp;{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}]TLDR;映射器输出映射器为每个字典键和染色体元素发出一条记录。每条记录都有一个关联的范围,可以在其中匹配其元素。((el-1000, el+1000), (dict-key, chromosome), el)(el-1000,el+1000)是任何其他染色体元素可以匹配的范围。(字典键,染色体)这条染色体的原始字典。el是染色体坐标中的一个元素。((19009495, 19011495), ('Key1', 'chr10'), 19010495)((19013590, 19015590), ('Key1', 'chr10'), 19014590)((19013064, 19015064), ('Key1', 'chr10'), 19014064)((19009495, 19011495), ('Key2', 'chr10'), 19010495)((19013658, 19015658), ('Key2', 'chr10'), 19014658)((19009502, 19011502), ('Key3', 'chr10'), 19010502)((19013641, 19015641), ('Key3', 'chr10'), 19014641)((37374766, 37376766), ('Key4', 'chr10'), 37375766)((37376526, 37378526), ('Key4', 'chr10'), 37377526)((76309389, 76311389), ('Key5', 'chr10'), 76310389)((76314990, 76316990), ('Key5', 'chr10'), 76315990)((76311224, 76313224), ('Key5', 'chr10'), 76312224)((76311963, 76313963), ('Key5', 'chr10'), 76312963)((14805147, 14807147), ('Key6', 'chr11'), 14806147)((14813006, 14815006), ('Key6', 'chr11'), 14814006)注意:映射器的输出未排序。排序我们需要使用 (el-1000, el+1000) 作为键对转换后的数据进行排序。这将允许我们检查下一个值是否在上一个值的范围内。由于键按排序顺序排列,因此我们将能够将指定重叠范围内的值链接在一起。((14805147, 14807147), ('Key6', 'chr11'), 14806147)((14813006, 14815006), ('Key6', 'chr11'), 14814006)((19009495, 19011495), ('Key1', 'chr10'), 19010495)((19009495, 19011495), ('Key2', 'chr10'), 19010495)((19009502, 19011502), ('Key3', 'chr10'), 19010502)((19013064, 19015064), ('Key1', 'chr10'), 19014064)((19013590, 19015590), ('Key1', 'chr10'), 19014590)((19013641, 19015641), ('Key3', 'chr10'), 19014641)((19013658, 19015658), ('Key2', 'chr10'), 19014658)((37374766, 37376766), ('Key4', 'chr10'), 37375766)((37376526, 37378526), ('Key4', 'chr10'), 37377526)((76309389, 76311389), ('Key5', 'chr10'), 76310389)((76311224, 76313224), ('Key5', 'chr10'), 76312224)((76311963, 76313963), ('Key5', 'chr10'), 76312963)((76314990, 76316990), ('Key5', 'chr10'), 76315990)群对指定重叠范围内的值进行分组。出现的列表将包含来自染色体的值,这些染色体位于前一条染色体的重叠范围内。[((14805147, 14807147), ('Key6', 'chr11'), 14806147)]----------------------------------------[((14813006, 14815006), ('Key6', 'chr11'), 14814006)]----------------------------------------[((19009495, 19011495), ('Key1', 'chr10'), 19010495),&nbsp;((19009495, 19011495), ('Key2', 'chr10'), 19010495),&nbsp;((19009502, 19011502), ('Key3', 'chr10'), 19010502)]----------------------------------------[((19013064, 19015064), ('Key1', 'chr10'), 19014064),&nbsp;((19013590, 19015590), ('Key1', 'chr10'), 19014590),&nbsp;((19013641, 19015641), ('Key3', 'chr10'), 19014641),&nbsp;((19013658, 19015658), ('Key2', 'chr10'), 19014658)]----------------------------------------[((37374766, 37376766), ('Key4', 'chr10'), 37375766)]----------------------------------------[((37376526, 37378526), ('Key4', 'chr10'), 37377526)]----------------------------------------[((76309389, 76311389), ('Key5', 'chr10'), 76310389)]----------------------------------------[((76311224, 76313224), ('Key5', 'chr10'), 76312224),&nbsp;((76311963, 76313963), ('Key5', 'chr10'), 76312963)]----------------------------------------[((76314990, 76316990), ('Key5', 'chr10'), 76315990)]----------------------------------------减少 - 第 1 阶段通过删除工程功能来清理数据。{'Key6': ['chr11', 14806147]}----------------------------------------{'Key6': ['chr11', 14814006]}----------------------------------------{'Key1': ['chr10', 19010495],&nbsp;'Key2': ['chr10', 19010495],&nbsp;'Key3': ['chr10', 19010502]}----------------------------------------{'Key1': ['chr10', 19014064, 19014590],&nbsp;'Key2': ['chr10', 19014658],&nbsp;'Key3': ['chr10', 19014641]}----------------------------------------{'Key4': ['chr10', 37375766]}----------------------------------------{'Key4': ['chr10', 37377526]}----------------------------------------{'Key5': ['chr10', 76310389]}----------------------------------------{'Key5': ['chr10', 76312224, 76312963]}----------------------------------------{'Key5': ['chr10', 76315990]}----------------------------------------减少 - 第 2 阶段将替换的字典键与其原始字典聚合。当字典键匹配时,追加相应染色体的值。{'Key6': ['chr11', 14806147, 14814006]}----------------------------------------{'Key1': ['chr10', 19010495, 19014064, 19014590],&nbsp;'Key2': ['chr10', 19010495, 19014658],&nbsp;'Key3': ['chr10', 19010502, 19014641]}----------------------------------------{'Key4': ['chr10', 37375766, 37377526]}----------------------------------------{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}----------------------------------------
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python